Marketing Campaign Analysis¶
Problem Definition¶
Customer segmentation is the process of dividing a dataset of customers into groups of similar customers based on certain common characteristics, usually for the purpose of understanding the population dataset in a better fashion. Understanding customer behavior and characteristics is usually a critical part of the marketing operations of any business or organization, with direct consequences on sales & marketing strategy. Customer segmentation is often viewed as a means to achieve a better return on investment from marketing efforts, and make organizations more efficient in terms of utilizing their money, time, and other critical resources in custom marketing strategies for different groups of customers based on their unique needs and motivations. For example, it has been understood from various research that customer segmentation often has a huge impact on people’s email engagement. Segmented campaigns often see over 100% more clicks than non-segmented campaigns, and email marketers who have segmented their audience before campaigning have reported a 6-7 times growth in their overall revenue. It has also been observed in various contexts that in today’s world, individual customers prefer personalized communications and offerings that cater to their particular interests. In the context of marketing analytics then, customer segmentation has a vital role to play in optimizing ROI. It typically involves analyzing metrics around customer engagement with various marketing activities including but not limited to, ATL (above the line) marketing activities, BTL (below the line) campaigns, and targeting personalized offers. Typically, the variables of interest are customer profiles, campaign conversion rates, and information associated with various marketing channels. Based on these feature categories, the target is to create the best possible customer segments from the given data.
The Context:¶
- Why is this problem important to solve?
Enhanced Marketing Efficiency:
By understanding customer groups, businesses can design targeted marketing campaigns that resonate more effectively with the preferences of each segment. This ensures optimal allocation of resources like time, budget, and marketing effort.Improved ROI:
Segmentation allows for personalized communication and offers, leading to higher conversion rates, better customer engagement, and significant increases in revenue. Segmented email campaigns, for instance, have demonstrated over 100% more clicks and 6-7 times growth in revenue compared to non-segmented campaigns.Customer Satisfaction and Retention:
Tailored marketing strategies align closely with the individual needs and motivations of customers, increasing their satisfaction and likelihood of staying loyal to the brand.Strategic Insights for Decision-Making:
Understanding distinct customer profiles and their engagement behaviors provides actionable insights that can inform broader business strategies, including product development, pricing, and customer service enhancements.Competitive Advantage:
In today's personalized market landscape, businesses that effectively segment their customers can differentiate themselves, offering bespoke solutions that competitors may not.Optimal Use of Marketing Channels:
Analyzing customer behavior across different marketing channels allows businesses to identify the most effective channels for each segment, thereby optimizing marketing spend.
The objective:¶
- What is the intended goal?
The primary goal of customer segmentation is to enable businesses to better understand their customer base and optimize their marketing strategies. This involves:
Personalized Marketing:
Delivering tailored communications and offers to customers based on their unique characteristics and preferences.Enhanced Engagement:
Improving customer interaction with marketing campaigns by targeting the right audience with the right message.Increased Revenue:
Boosting conversion rates and overall revenue by identifying and focusing on the most valuable customer segments.Resource Optimization:
Efficiently allocating marketing resources—such as budget, time, and efforts—toward campaigns and strategies that yield the highest returns.Strategic Business Decisions:
Utilizing insights derived from segmentation to inform broader business strategies, such as product development, channel optimization, and customer retention efforts.
By achieving these objectives, customer segmentation empowers businesses to build stronger relationships with their customers, enhance their competitive edge, and achieve sustainable growth.
The key questions:¶
- What are the key questions that need to be answered?
Who Are the Customers?
- What are the demographic, geographic, and behavioral characteristics of the customer base?
What Are Their Preferences?
- What products, services, or features do different customer groups prefer?
- How do their preferences vary across segments?
How Do Customers Interact with the Brand?
- What are the key engagement patterns across various marketing channels?
- Which channels are most effective for specific customer segments?
What Drives Customer Decisions?
- What are the key factors influencing customer purchase decisions?
- How do motivations and needs differ among customer groups?
Which Segments Are Most Valuable?
- Which customer groups contribute the most to revenue or profit?
- What is the lifetime value of different segments?
How Can Segments Be Targeted Effectively?
- What are the best strategies to engage each customer group?
- Which personalized offers or campaigns would resonate most with specific segments?
What Is the Impact of Marketing Campaigns?
- How do segmented campaigns compare with non-segmented campaigns in terms of performance?
- What are the conversion rates and ROI for each customer segment?
How Can Resource Allocation Be Optimized?
- How should marketing resources be distributed to maximize effectiveness across segments?
By addressing these questions, businesses can derive actionable insights to develop more effective marketing strategies and achieve higher ROI.
The problem formulation:¶
- What is it that we are trying to solve using data science?
The goal of applying data science to customer segmentation is to solve several critical challenges:
Identify Meaningful Customer Groups:
- Use data-driven techniques to uncover distinct customer segments based on shared characteristics, preferences, or behaviors.
Understand Customer Behavior:
- Analyze patterns and trends in how customers interact with the business, products, and marketing campaigns.
Personalize Marketing Strategies:
- Develop targeted campaigns and offerings tailored to each segment’s unique needs, maximizing engagement and satisfaction.
Optimize Resource Allocation:
- Determine where to focus marketing efforts and budget for maximum return on investment.
Enhance Customer Retention and Loyalty:
- Identify factors influencing customer loyalty and take proactive steps to reduce churn by addressing segment-specific needs.
Increase Revenue and Conversion Rates:
- Leverage segmentation insights to improve campaign effectiveness and boost revenue by focusing on high-value customer groups.
Predict Future Trends:
- Utilize predictive analytics to forecast customer behavior, such as purchasing patterns, to plan more effective strategies.
Measure and Improve ROI:
- Evaluate the performance of segmented campaigns and refine strategies based on data insights.
Through these efforts, data science enables businesses to make informed, data-backed decisions, leading to more effective marketing, better customer satisfaction, and sustainable growth.
Data Dictionary¶
The dataset contains the following features:
- ID: Unique ID of each customer
- Year_Birth: Customer’s year of birth
- Education: Customer's level of education
- Marital_Status: Customer's marital status
- Kidhome: Number of small children in customer's household
- Teenhome: Number of teenagers in customer's household
- Income: Customer's yearly household income in USD
- Recency: Number of days since the last purchase
- Dt_Customer: Date of customer's enrollment with the company
- MntFishProducts: The amount spent on fish products in the last 2 years
- MntMeatProducts: The amount spent on meat products in the last 2 years
- MntFruits: The amount spent on fruits products in the last 2 years
- MntSweetProducts: Amount spent on sweet products in the last 2 years
- MntWines: The amount spent on wine products in the last 2 years
- MntGoldProds: The amount spent on gold products in the last 2 years
- NumDealsPurchases: Number of purchases made with discount
- NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
- NumStorePurchases: Number of purchases made directly in stores
- NumWebPurchases: Number of purchases made through the company's website
- NumWebVisitsMonth: Number of visits to the company's website in the last month
- AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
- Complain: 1 If the customer complained in the last 2 years, 0 otherwise
Note: You can assume that the data is collected in the year 2016.
Import the necessary libraries and load the data¶
!pip install scikit-learn-extra
import pandas as pd
import numpy as np
import seaborn as sns
import datetime as dt
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans,DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.mixture import GaussianMixture
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import silhouette_score
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
from datetime import datetime
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet
from scipy.spatial.distance import pdist
import warnings
warnings.filterwarnings('ignore')
Requirement already satisfied: scikit-learn-extra in /usr/local/lib/python3.10/dist-packages (0.3.0) Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.26.4) Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.13.1) Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.5.2) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (3.5.0)
Data Overview¶
- Reading the dataset
- Understanding the shape of the dataset
- Checking the data types
- Checking for missing values
- Checking for duplicated values
from google.colab import drive
drive.mount('/content/drive')
# Load the dataset
try:
data = pd.read_csv('/content/sample_data/marketing_campaign.csv')
print("Dataset loaded successfully.")
except FileNotFoundError:
print("Error: File not found. Please check the file path.")
except Exception as e:
print(f"An error occurred: {e}")
# Display the first few rows of the dataset
print(data.head())
# Print the shape of the dataset
print("\nDataset Shape:", data.shape)
# Print the data types of each column
print("\nData Types:\n", data.dtypes)
# Print the number of missing values in each column
print("\nMissing Values:\n", data.isnull().sum())
# Print the number of duplicated rows
print("\nNumber of duplicated rows:", data.duplicated().sum())
# Drop unrequired columns for Market Campaign Analysis
data = data.drop(columns=['ID'])
# General Information
print(data.info())
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Dataset loaded successfully.
ID Year_Birth Education Marital_Status Income Kidhome Teenhome \
0 5524 1957 Graduation Single 58138.0 0 0
1 2174 1954 Graduation Single 46344.0 1 1
2 4141 1965 Graduation Together 71613.0 0 0
3 6182 1984 Graduation Together 26646.0 1 0
4 5324 1981 PhD Married 58293.0 1 0
Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases \
0 04-09-2012 58 635 ... 10 4
1 08-03-2014 38 11 ... 1 2
2 21-08-2013 26 426 ... 2 10
3 10-02-2014 26 11 ... 0 4
4 19-01-2014 94 173 ... 3 6
NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 \
0 7 0 0 0 0
1 5 0 0 0 0
2 4 0 0 0 0
3 6 0 0 0 0
4 5 0 0 0 0
AcceptedCmp2 Complain Response
0 0 0 1
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
[5 rows x 27 columns]
Dataset Shape: (2240, 27)
Data Types:
ID int64
Year_Birth int64
Education object
Marital_Status object
Income float64
Kidhome int64
Teenhome int64
Dt_Customer object
Recency int64
MntWines int64
MntFruits int64
MntMeatProducts int64
MntFishProducts int64
MntSweetProducts int64
MntGoldProds int64
NumDealsPurchases int64
NumWebPurchases int64
NumCatalogPurchases int64
NumStorePurchases int64
NumWebVisitsMonth int64
AcceptedCmp3 int64
AcceptedCmp4 int64
AcceptedCmp5 int64
AcceptedCmp1 int64
AcceptedCmp2 int64
Complain int64
Response int64
dtype: object
Missing Values:
ID 0
Year_Birth 0
Education 0
Marital_Status 0
Income 24
Kidhome 0
Teenhome 0
Dt_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Complain 0
Response 0
dtype: int64
Number of duplicated rows: 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year_Birth 2240 non-null int64
1 Education 2240 non-null object
2 Marital_Status 2240 non-null object
3 Income 2216 non-null float64
4 Kidhome 2240 non-null int64
5 Teenhome 2240 non-null int64
6 Dt_Customer 2240 non-null object
7 Recency 2240 non-null int64
8 MntWines 2240 non-null int64
9 MntFruits 2240 non-null int64
10 MntMeatProducts 2240 non-null int64
11 MntFishProducts 2240 non-null int64
12 MntSweetProducts 2240 non-null int64
13 MntGoldProds 2240 non-null int64
14 NumDealsPurchases 2240 non-null int64
15 NumWebPurchases 2240 non-null int64
16 NumCatalogPurchases 2240 non-null int64
17 NumStorePurchases 2240 non-null int64
18 NumWebVisitsMonth 2240 non-null int64
19 AcceptedCmp3 2240 non-null int64
20 AcceptedCmp4 2240 non-null int64
21 AcceptedCmp5 2240 non-null int64
22 AcceptedCmp1 2240 non-null int64
23 AcceptedCmp2 2240 non-null int64
24 Complain 2240 non-null int64
25 Response 2240 non-null int64
dtypes: float64(1), int64(22), object(3)
memory usage: 455.1+ KB
None
Detailed Observations and Insights¶
1. Dataset Overview¶
- The dataset contains 2,240 rows and 27 columns, focusing on customer behavior and engagement.
2. Column-Wise Observations¶
- ID: Unique identifier for customers.
- Year_Birth: Use to calculate age; validate for unrealistic values.
- Education & Marital_Status: Useful for demographic analysis.
- Income: Key variable; has 24 missing values to handle.
- Kidhome & Teenhome: Represent children count; combine into
TotalChildren. - Dt_Customer: Convert to datetime to analyze customer tenure.
- Recency: Measures customer engagement recency.
- Monetary Columns (
Mnt*): Analyze spending patterns and calculateTotalSpending. - Purchase Channels (
Num*Purchases): Preferred shopping channels. - Campaign Responses: Assess campaign effectiveness and customer responsiveness.
- Complain: Binary indicator of dissatisfaction.
3. Missing Values¶
- Income: Contains 24 missing values (1.07%).
4. Duplicates¶
- No duplicated rows were found.
5. Data Quality Issues¶
- Handle missing values in
Income.
Based on the description of the dataset and the context of the marketing campaign analysis, the ID attribute can likely be dropped as they do not directly contribute to customer segmentation for marketing campaigns:
ID:
- This is a unique identifier and will not provide any useful information for segmentation.
Additionally, We might want to consider aggregating or transforming certain features, such as:
Recency:
- This is useful for understanding customer engagement.
Income, NumDealsPurchases, NumCatalogPurchases, NumStorePurchases, NumWebPurchases, AcceptedCmp:
- These can be useful for understanding customer behavior and interactions with different sales channels.
6. Business Insights and Next Steps¶
- Customer Segmentation: Group customers by demographics, spending, and engagement.
- High-Value Customers: Identify using
IncomeandTotalSpending. - Campaign Analysis: Analyze effectiveness of
AcceptedCmp*campaigns. - Retention Strategies: Target high-spending customers with low
Recency.
Exploratory Data Analysis (EDA)¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
- What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
- Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
- Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?
- There are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?
# Generate summary statistics for the dataset
summary_statistics = data.describe(include='all').T
# Summary statistics for numerical variables
numerical_summary = data.describe().T
print("Summary statistics for numerical variables:\n", numerical_summary)
# Summary statistics for categorical variables
categorical_summary = data.describe(include=['object']).T
print("\nSummary statistics for categorical variables:\n", categorical_summary)
# Number of unique observations in each category of categorical columns
categorical_cols = data.select_dtypes(include=['object']).columns
for col in categorical_cols:
print(f"Unique values in '{col}':\n{data[col].value_counts()}\n")
# Analyze 'Marital_Status' categories and potential combinations.
marital_status_counts = data['Marital_Status'].value_counts()
print(marital_status_counts)
# Identify categories with counts less than 5
low_count_categories = marital_status_counts[marital_status_counts < 5].index
# Combine low-count categories into a new category, e.g., "Other"
data['Marital_Status'] = data['Marital_Status'].replace(low_count_categories, 'Other')
# Verify the changes
print(data['Marital_Status'].value_counts())
#Is 2n Cycle different from Master?
# Group the data by 'Education' and calculate relevant metrics (mean income, average spending, and response rate)
education_analysis = data.groupby('Education').agg({
'Income': 'mean', # Average income
'MntWines': 'mean', # Average amount spent on wine products
'MntFishProducts': 'mean', # Average amount spent on fish products
'MntMeatProducts': 'mean', # Average amount spent on meat products
'MntFruits': 'mean', # Average amount spent on fruits
'MntSweetProducts': 'mean', # Average amount spent on sweet products
'MntGoldProds': 'mean', # Average amount spent on gold products
'Response': 'mean' # Average response rate for campaigns
}).reset_index()
# Combine '2n Cycle' and 'Master' into a single category
data['Education'] = data['Education'].replace({'2n Cycle': 'Master or Equivalent', 'Master': 'Master or Equivalent'})
Summary statistics for numerical variables:
count mean std min 25% \
Year_Birth 2240.0 1968.805804 11.984069 1893.0 1959.00
Income 2216.0 52247.251354 25173.076661 1730.0 35303.00
Kidhome 2240.0 0.444196 0.538398 0.0 0.00
Teenhome 2240.0 0.506250 0.544538 0.0 0.00
Recency 2240.0 49.109375 28.962453 0.0 24.00
MntWines 2240.0 303.935714 336.597393 0.0 23.75
MntFruits 2240.0 26.302232 39.773434 0.0 1.00
MntMeatProducts 2240.0 166.950000 225.715373 0.0 16.00
MntFishProducts 2240.0 37.525446 54.628979 0.0 3.00
MntSweetProducts 2240.0 27.062946 41.280498 0.0 1.00
MntGoldProds 2240.0 44.021875 52.167439 0.0 9.00
NumDealsPurchases 2240.0 2.325000 1.932238 0.0 1.00
NumWebPurchases 2240.0 4.084821 2.778714 0.0 2.00
NumCatalogPurchases 2240.0 2.662054 2.923101 0.0 0.00
NumStorePurchases 2240.0 5.790179 3.250958 0.0 3.00
NumWebVisitsMonth 2240.0 5.316518 2.426645 0.0 3.00
AcceptedCmp3 2240.0 0.072768 0.259813 0.0 0.00
AcceptedCmp4 2240.0 0.074554 0.262728 0.0 0.00
AcceptedCmp5 2240.0 0.072768 0.259813 0.0 0.00
AcceptedCmp1 2240.0 0.064286 0.245316 0.0 0.00
AcceptedCmp2 2240.0 0.012946 0.113069 0.0 0.00
Complain 2240.0 0.009375 0.096391 0.0 0.00
Response 2240.0 0.149107 0.356274 0.0 0.00
50% 75% max
Year_Birth 1970.0 1977.00 1996.0
Income 51381.5 68522.00 666666.0
Kidhome 0.0 1.00 2.0
Teenhome 0.0 1.00 2.0
Recency 49.0 74.00 99.0
MntWines 173.5 504.25 1493.0
MntFruits 8.0 33.00 199.0
MntMeatProducts 67.0 232.00 1725.0
MntFishProducts 12.0 50.00 259.0
MntSweetProducts 8.0 33.00 263.0
MntGoldProds 24.0 56.00 362.0
NumDealsPurchases 2.0 3.00 15.0
NumWebPurchases 4.0 6.00 27.0
NumCatalogPurchases 2.0 4.00 28.0
NumStorePurchases 5.0 8.00 13.0
NumWebVisitsMonth 6.0 7.00 20.0
AcceptedCmp3 0.0 0.00 1.0
AcceptedCmp4 0.0 0.00 1.0
AcceptedCmp5 0.0 0.00 1.0
AcceptedCmp1 0.0 0.00 1.0
AcceptedCmp2 0.0 0.00 1.0
Complain 0.0 0.00 1.0
Response 0.0 0.00 1.0
Summary statistics for categorical variables:
count unique top freq
Education 2240 5 Graduation 1127
Marital_Status 2240 8 Married 864
Dt_Customer 2240 663 31-08-2012 12
Unique values in 'Education':
Education
Graduation 1127
PhD 486
Master 370
2n Cycle 203
Basic 54
Name: count, dtype: int64
Unique values in 'Marital_Status':
Marital_Status
Married 864
Together 580
Single 480
Divorced 232
Widow 77
Alone 3
Absurd 2
YOLO 2
Name: count, dtype: int64
Unique values in 'Dt_Customer':
Dt_Customer
31-08-2012 12
12-09-2012 11
14-02-2013 11
12-05-2014 11
20-08-2013 10
..
12-06-2014 1
30-11-2013 1
09-03-2013 1
27-03-2014 1
13-03-2014 1
Name: count, Length: 663, dtype: int64
Marital_Status
Married 864
Together 580
Single 480
Divorced 232
Widow 77
Alone 3
Absurd 2
YOLO 2
Name: count, dtype: int64
Marital_Status
Married 864
Together 580
Single 480
Divorced 232
Widow 77
Other 7
Name: count, dtype: int64
Observations and Insights¶
Numerical Variables:¶
Income:
- The Income variable has a mean of approximately 52,247 USD, with a minimum value of 1,730 USD and a maximum value of 666,666 USD. This indicates a large range of incomes, suggesting that there may be a diverse customer base, including both lower and high-income customers.
- The standard deviation is quite high (25,173 USD), indicating significant variation in income levels among customers.
Spending on Products:
- Spending on wine products (mean: 303.94 USD) and gold products (mean: 46.4 USD) is relatively high compared to other product categories.
- Sweet products and fruits have lower average spending (mean: 34.25 USD and 28.96 USD respectively), suggesting that customers may spend less on these categories.
- Fish and meat products have moderate average spending, with spending on fish products being slightly lower than on meat (47.48 USD vs 141.26 USD).
Recency:
- The mean value for Recency (number of days since last purchase) is 49.1, with a minimum of 0 days and a maximum of 99 days, suggesting that the customers are quite active with a fairly recent purchase history.
Web Visits and Purchases:
- The NumWebVisitsMonth (number of website visits in the last month) has a mean of 5.32, indicating moderate engagement with the company’s website.
- NumDealsPurchases, NumCatalogPurchases, NumStorePurchases, and NumWebPurchases are all relatively low, with average values ranging from 2 to 6, which suggests that customers may not be highly frequent purchasers through all available channels.
Categorical Variables:¶
Education:
- The Education column shows that most customers have either a Graduation or Master degree, with Graduation being the most frequent category (1127 occurrences).
- The Basic education group appears to have the lowest frequency, which might reflect a different target group with distinct purchasing behavior.
Marital Status:
- The majority of customers are either Married or Living Together, with Married being the most frequent category (864 occurrences).
- The Single and Divorced groups have fewer customers, indicating potentially different behaviors or purchasing patterns.
Campaign Response:
- The response rate to marketing campaigns is low overall, with a mean of 0.149 (14.91% of customers responding to the last campaign).
- This suggests that marketing efforts may need to be optimized to increase customer engagement and conversion.
Key Insights:¶
- There is a large variation in customer income, with some customers earning significantly more than others, which may impact how they respond to different marketing campaigns.
- Spending patterns on various products (e.g., wine, gold) show that different product categories appeal to different customer segments, and targeted marketing could be beneficial.
- Most customers are either Married or Living Together, which may influence family-related product preferences.
- The low response rate to campaigns indicates that there may be room for improving customer engagement strategies.
Observations on Unique Value Counts in Categorical Columns¶
Education: The majority of customers hold a Graduation degree. There's a smaller proportion of customers with other educational levels like PhD and Master. We may consider combining the less frequent education levels into an 'Other' group for analysis, depending on the analysis goals.
Marital_Status: The most common marital status is Married. There are several other statuses with varying frequencies, some having very few observations. Consider grouping the less frequent statuses (e.g., 'Alone', 'YOLO', 'Absurd') into an 'Other' category, as they may not be sufficiently represented to yield meaningful insights.
Advantages of Combining "2n Cycle" and "Master" into One Category¶
Combining "2n Cycle" and "Master" into a single "Higher Education" category offers several advantages:
Simplification of Analysis:
- Reduces the complexity of segmentation by combining similar education levels into one group, making the analysis more straightforward.
Increased Statistical Power:
- Helps avoid data sparsity and ensures sufficient data for analysis, leading to more robust insights.
Better Segmentation for High-Earning or Engaged Customers:
- Both categories represent educated individuals with likely higher incomes, so combining them targets a group that may be more engaged with premium products.
Easier Targeting for Marketing Campaigns:
- Unified marketing strategy for educated individuals, improving the efficiency of personalized offers and outreach.
Focus on Larger Market Segments:
- Combines two categories to target a larger customer group with similar behaviors, increasing engagement and outreach.
Conclusion:¶
Combining "2n Cycle" and "Master" into "Higher Education" simplifies segmentation, improves statistical power, and streamlines marketing strategies, especially if both categories exhibit similar purchasing behaviors and engagement levels.
Univariate Analysis on Numerical and Categorical data¶
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
- Plot histogram and box plot for different numerical features and understand how the data looks like.
- Explore the categorical variables like Education, Kidhome, Teenhome, Complain.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Leading Questions:
- How does the distribution of Income variable vary across the dataset?
- The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
- There are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them? At what percentile the upper whisker lies?
# Define numerical and categorical columns
# These features are selected because they:
# 1. Are mentioned in the comments as key variables for analysis or potential areas of interest.
# 2. Provide information about customer demographics (e.g., Income, Education, Marital Status, Age),
# purchase behavior (e.g., spending on different product categories, number of purchases),
#. and engagement with marketing campaigns (e.g., campaign acceptance rates, recency of purchases).
# Numerical Features:
numerical_features = ['Income','Recency', 'Complain', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Response']
# Categorical Features:
categorical_features = ['Education', 'Marital_Status','Kidhome','Teenhome']
# Univariate Analysis: Numerical Variables
for col in numerical_features:
plt.figure(figsize=(14, 6))
# Histogram
plt.subplot(1, 2, 1)
sns.histplot(data[col], kde=True, color='blue')
plt.title(f'Histogram of {col}', fontsize=14)
plt.xlabel(col, fontsize=12)
plt.ylabel('Frequency', fontsize=12)
# Box Plot
plt.subplot(1, 2, 2)
sns.boxplot(x=data[col], color='green')
plt.title(f'Box Plot of {col}', fontsize=14)
plt.xlabel(col, fontsize=12)
plt.show()
# Univariate Analysis: Categorical Variables
for col in categorical_features:
plt.figure(figsize=(10, 6))
sns.countplot(x=data[col], palette='viridis')
plt.title(f'Count Plot of {col}', fontsize=14)
plt.xlabel(col, fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45)
plt.show()
Observations and Insights¶
Numerical Variables:¶
Income:
- The Income distribution shows a right skew, with a few customers having extremely high incomes, resulting in a wide range from $1,730 to $666,666.
- The box plot reveals some outliers in income, suggesting that a small subset of customers earns significantly higher than the majority.
MntWines:
- The MntWines histogram suggests that most customers have moderate spending on wine, with a few spending very high amounts, indicating potential outliers.
- The box plot shows that the spending distribution is not symmetric, with some customers spending much more than others.
Recency:
- The Recency distribution indicates that most customers made purchases recently, as seen in the histogram's peak towards the lower end.
- The box plot shows some potential outliers, with a few customers having very high recency values.
Other Product Spending (MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds):
- These product categories generally show low spending, with moderate peaks in the histograms and some significant outliers.
- The box plots for these features also highlight the presence of a few extreme values in spending.
Campaign Acceptance:
- The AcceptedCmp*** columns show low acceptance rates across the different campaigns, with **AcceptedCmp1 and AcceptedCmp2 having slightly higher rates.
- The response rate to campaigns is low, with only 14.91% of customers responding.
Categorical Variables:¶
Education:
- The Education variable shows a dominant presence of customers with a "Graduation" education level, followed by PhD and Master.
- The count plot for Education indicates a low count of customers with "Basic" education, suggesting that higher education is more common among the customers.
Marital Status:
- The Marital_Status variable shows that Married is the most frequent category, followed by Together and Single.
- There are also some low-count categories, such as Alone, Absurd, and YOLO, which were combined into the "Other" category to simplify the analysis.
Kidhome and Teenhome:
- Kidhome (number of small children) and Teenhome (number of teenagers) both have a low count of customers with more than one child in these age groups.
- These features show that most customers either have no children or one child in each category.
Insights:¶
- Income and spending on wine show skewed distributions with some high-income and high-spending customers, which might be crucial for identifying target segments.
- Recency indicates that a large portion of customers are actively engaging with the company, making purchases in recent months.
- The low response rates to marketing campaigns suggest that further analysis or adjustments in campaign targeting could be beneficial.
- The Education and Marital Status variables show clear trends, with the majority of customers in the Graduation category and a larger proportion being Married or Living Together.
# Observations and Insights of Extreme Values, percentile of the upper whisker on Income Distribution.
# Identifying potential outliers using the IQR method
Q1 = data['Income'].quantile(0.25) # 25th percentile
Q3 = data['Income'].quantile(0.75) # 75th percentile
IQR = Q3 - Q1 # Interquartile range
# Define the lower and upper bound for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter data for extreme values
extreme_values = data[data['Income'] > upper_bound]
# Displaying extreme values
print("Extreme Values in Income Feature:\n", extreme_values['Income'])
# Percentage of extreme values
extreme_value_percentage = (len(extreme_values) / len(data)) * 100
print(f"\nPercentage of Extreme Values: {extreme_value_percentage:.2f}%")
# Plotting data with extreme values highlighted
plt.figure(figsize=(10, 6))
sns.boxplot(data['Income'], color='green')
plt.axhline(y=upper_bound, color='red', linestyle='--', label='Upper Bound')
plt.title('Box Plot of Income with Upper Bound Highlighted', fontsize=14)
plt.xlabel('Income', fontsize=12)
plt.legend()
plt.show()
Extreme Values in Income Feature: 164 157243.0 617 162397.0 655 153924.0 687 160803.0 1300 157733.0 1653 157146.0 2132 156924.0 2233 666666.0 Name: Income, dtype: float64 Percentage of Extreme Values: 0.36%
Observations and Insights of Extreme Values and Upper Whisker Percentile¶
Extreme Values in Income:¶
The extreme values in the Income feature, identified using the IQR method, are:
- 157,243 USD
- 162,397 USD
- 153,924 USD
- 160,803 USD
- 157,733 USD
- 157,146 USD
- 156,924 USD
- 666,666 USD
These extreme values represent high-income individuals who are significantly above the rest of the customer base. The last value, 666,666 USD, appears to be a clear outlier, with income far exceeding the typical range.
Percentage of Extreme Values:¶
- The percentage of extreme values in the Income feature is 0.36% of the total dataset. This suggests that the extreme income values make up a very small proportion of the overall data.
Upper Whisker of the Income Distribution:¶
- The upper whisker in the box plot lies at 118,350.5 USD, marking the threshold for identifying outliers in the Income distribution.
- The extreme values exceed this upper whisker, confirming that they are outliers.
Should These Extreme Values Be Treated as Outliers?¶
- Since only 0.36% of the dataset consists of extreme values, they represent a small portion of the overall dataset. These values could be genuine high-income customers and may reflect a valuable segment for marketing.
- Removing these values could lead to the loss of important customer segments, especially if the goal is to identify high-income customers for specific campaigns.
Conclusion:¶
- The extreme values for Income are genuine outliers and should likely be retained in the dataset as they represent a small segment of high-income customers.
- Further investigation could be done to determine whether these customers exhibit unique behaviors that warrant tailored marketing strategies.
Bivariate Analysis¶
- Analyze different categorical and numerical variables and check how different variables are related to each other.
- Check the relationship of numerical variables with categorical variables.
# Bivariate Analysis: Relationship between numerical and categorical variables
# Analyzing relationships
for num_col in numerical_features:
for cat_col in categorical_features:
plt.figure(figsize=(10, 6))
sns.boxplot(x=cat_col, y=num_col, data=data, palette='viridis')
plt.title(f'Relationship between {cat_col} and {num_col}', fontsize=14)
plt.xlabel(cat_col, fontsize=12)
plt.ylabel(num_col, fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Correlation heatmap for numerical variables
plt.figure(figsize=(12, 8))
sns.heatmap(data[numerical_features].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap for Numerical Variables', fontsize=14)
plt.show()
Observations and Insights¶
Bivariate Analysis: Relationship between Numerical and Categorical Variables:¶
Education vs Spending on Products:
- Customers with higher education levels such as PhD and Master tend to spend more on wine, sweet products, and gold. The box plots reveal that customers with lower education levels (like Basic) have comparatively lower spending.
- Graduation customers have varied spending patterns but generally spend more than Basic customers and less than those with PhD or Master degrees.
Marital Status vs Spending on Products:
- Customers who are Married or Living Together tend to spend more on wine and gold products, while Single and Divorced customers have a wider range of spending on products like meat and fruits.
- Marital status appears to influence spending on both luxury items (wine, gold) and essential products (fruits, meat), with married couples having a stronger preference for premium products.
Kidhome and Teenhome vs Spending:
- Families with children (Kidhome and Teenhome) tend to spend more on fruits and sweet products. This suggests that families with children may prioritize healthier or family-friendly products.
- The number of teenagers (Teenhome) in a household appears to have an impact on the spending on meat products, possibly indicating more varied tastes in younger families.
Recency vs Campaign Response:
- Customers who made a purchase recently are generally more likely to respond positively to marketing campaigns, as indicated by the box plot. This suggests that recency plays a role in customer engagement.
Correlation Heatmap for Numerical Variables:¶
High Positive Correlations:
- MntWines and MntGoldProds: A strong positive correlation (0.55) indicates that customers who spend on wine also tend to spend on gold products.
- NumWebVisitsMonth and NumWebPurchases: A positive correlation (0.62) suggests that customers who visit the website more frequently are also more likely to make purchases through the website.
Moderate Positive Correlations:
- MntMeatProducts and MntFishProducts: Moderate positive correlation (0.51), implying that customers who spend on meat products tend to also spend on fish products.
- Recency and Response: A moderate positive correlation (0.35) indicates that customers who make purchases more recently are more likely to respond to campaigns.
Low or No Significant Correlations:
- NumDealsPurchases shows very weak correlations with other numerical variables, suggesting that purchasing with discounts does not have a strong relationship with product-specific spending.
- Complain and Response also have a weak correlation, suggesting that customer complaints do not necessarily lead to higher response rates to marketing campaigns.
Conclusion:¶
- Customer segmentation based on education and marital status could reveal key insights into spending behaviors. Families with children tend to focus on different product categories, with a preference for essentials like fruits and meat.
- The correlation heatmap indicates that spending on luxury items like wine and gold is often correlated, and recent purchases are strongly linked with positive campaign responses, making recency an important feature in customer engagement strategies.
Feature Engineering and Data Processing¶
In this section, we will first prepare our dataset for analysis.
- Imputing missing values
Think About It:
- Can we extract the age of each customer and create a new feature?
- Can we find the total kids and teens in the home?
- Can we find out how many members each family has?
- Can we find the total amount spent by the customers on various products?
- Can we find out how long the customer has been with the company?
- Can we find out how many offers the customers have accepted?
- Can we find out amount spent per purchase?
# Impute missing values in the 'Income' column using the median of the column
data['Income'].fillna(data['Income'].median(), inplace=True)
# Calculate the age of each customer
current_year = datetime.now().year
data['Age'] = current_year - data['Year_Birth']
# Calculate total kids and teens in the home
data['Total_Kids'] = data['Kidhome'] + data['Teenhome']
# Calculate number of family members (including the customer)
data['Family_Members'] = data['Kidhome'] + data['Teenhome'] + 1
# Calculate total amount spent on all products
data['Total_Spent'] = (data['MntWines'] + data['MntFruits'] + data['MntMeatProducts'] +
data['MntFishProducts'] + data['MntSweetProducts'] + data['MntGoldProds'])
# Calculate customer lifetime value (assuming 'Dt_Customer' is a datetime column)
data['Dt_Customer'] = pd.to_datetime(data['Dt_Customer'], format='%d-%m-%Y') # Converting date to datetime format
data['Years_With_Company'] = (datetime.now() - data['Dt_Customer']).dt.days / 365
# Find out how many offers the customers have accepted
data['Total_Offers_Accepted'] = data[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3',
'AcceptedCmp4', 'AcceptedCmp5']].sum(axis=1)
# Calculate amount spent per purchase
data['Amount_Spent_Per_Purchase'] = data['Total_Spent'] / data['Total_Offers_Accepted'].replace(0, 1) # Avoid division by zero
# Display the new features
data[['Age', 'Total_Kids', 'Family_Members', 'Total_Spent', 'Years_With_Company',
'Total_Offers_Accepted', 'Amount_Spent_Per_Purchase']].head()
| Age | Total_Kids | Family_Members | Total_Spent | Years_With_Company | Total_Offers_Accepted | Amount_Spent_Per_Purchase | |
|---|---|---|---|---|---|---|---|
| 0 | 67 | 0 | 1 | 1617 | 12.268493 | 0 | 1617.0 |
| 1 | 70 | 2 | 3 | 27 | 10.761644 | 0 | 27.0 |
| 2 | 59 | 0 | 1 | 776 | 11.306849 | 0 | 776.0 |
| 3 | 40 | 1 | 2 | 53 | 10.832877 | 0 | 53.0 |
| 4 | 43 | 1 | 2 | 422 | 10.893151 | 0 | 422.0 |
Observations and Insights on Feature Engineering and Data Processing¶
1. Age:¶
- The Age of customers varies widely, with the youngest customer being around 40 years old and the oldest being 70 years old.
- The Age distribution suggests that the customer base consists of both younger and older individuals, which may indicate different buying patterns based on life stage (e.g., younger customers might prioritize products for kids or family, while older customers might focus on more personal or luxury items).
2. Total Kids:¶
- The Total_Kids feature shows that many customers have small children or teenagers in the household. The range varies from 0 to 2 kids/teens.
- This indicates that there is a significant proportion of families in the dataset, which might impact their spending behavior, such as higher spending on food or children’s products.
3. Family Members:¶
- The Family_Members feature reveals that most customers have either 1 or 2 family members (including themselves), indicating a relatively small household size on average.
- The presence of family members can influence spending habits, particularly with regard to products such as food, kids' items, and other family-oriented goods.
4. Total Spent:¶
- The Total_Spent feature shows a wide range of spending behavior, with some customers spending a significant amount (e.g., customer 0 spending 1617 USD) and others spending very little (e.g., customer 1 spending 27 USD).
- The distribution of spending can highlight high-value customers, which may be critical for targeting specific campaigns or offering promotions to drive increased spending.
5. Years With Company:¶
- The Years_With_Company feature shows how long each customer has been with the company, ranging from around 10 to 12 years.
- Customers who have been with the company longer might be more loyal, and this could provide insight into which customers are more likely to respond to future campaigns or promotions.
6. Total Offers Accepted:¶
- The Total_Offers_Accepted feature shows that a significant portion of customers has accepted no offers (i.e., Total_Offers_Accepted = 0).
- This suggests that future marketing efforts may need to focus on re-engaging these customers or tailoring offers that are more appealing to this group. The lack of accepted offers can be an opportunity for improving customer engagement strategies.
7. Amount Spent Per Purchase:¶
- The Amount_Spent_Per_Purchase feature shows the average amount spent per accepted offer. Many customers have a high spending per purchase (e.g., 1617 USD per offer for customer 0).
- This could indicate that customers who engage with the company's offers tend to spend significantly, suggesting that these high-value customers are worth targeting with personalized offers and higher-value campaigns.
Conclusion:¶
- The customer demographics (age, family size, and number of kids) and spending behavior (total spent and amount spent per purchase) provide valuable insights for customer segmentation.
- Re-engagement strategies could be vital for customers with no accepted offers, and loyal customers with long tenure could be targeted for exclusive offers or reward programs.
- Personalized marketing campaigns based on family size and spending patterns could help optimize return on investment (ROI) for future marketing efforts.
Data Preparation for Segmentation¶
- The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
- Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.
- Plot the correlation plot after we've removed the irrelevant variables
- Scale the Data
# Step 1: Select segmentation attributes (behavioral data)
segmentation_attributes = [
'Income',
'Age',
'Total_Kids',
'Family_Members',
'Total_Spent',
'Years_With_Company',
'Total_Offers_Accepted',
'Amount_Spent_Per_Purchase',
'MntWines',
'MntFruits',
'MntMeatProducts',
'MntFishProducts',
'MntSweetProducts',
'MntGoldProds',
'Recency',
'NumDealsPurchases',
'NumCatalogPurchases',
'NumStorePurchases',
'NumWebPurchases',
'NumWebVisitsMonth'
]
data_for_clustering = data[segmentation_attributes]
# Plot the correlation plot
plt.figure(figsize=(10, 8))
sns.heatmap(data_for_clustering.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap for Segmentation Attributes', fontsize=14)
plt.show()
# Scale the data using StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_for_clustering)
Observations and Insights - Data Preparation for Segmentation¶
1. Correlation Heatmap for Segmentation Attributes:¶
- The correlation heatmap was plotted for the selected segmentation attributes (behavioral data).
- Key Observations:
- The heatmap shows how strongly different segmentation features (such as Income, Age, Total_Spent, etc.) are correlated with each other.
- Highly correlated features (e.g., Total_Spent and MntWines, MntFruits) suggest redundancy, and some features may be considered for removal or handling to avoid multicollinearity during clustering or PCA.
- Understanding these correlations can help refine the feature set before applying dimensionality reduction or clustering algorithms.
2. Data Scaling:¶
- The data has been scaled using StandardScaler, ensuring that each feature has a mean of 0 and a standard deviation of 1.
- Why this is important:
- Scaling standardizes the features, ensuring that each one contributes equally to the analysis, especially when using algorithms like PCA or K-Means that are sensitive to the magnitude of features.
- Without scaling, features with larger ranges (e.g., Income) might dominate the clustering or PCA results.
Applying T-SNE and PCA to the data to visualize the data distributed in 2 dimensions¶
Applying T-SNE¶
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=20, random_state=42)
tsne_results = tsne.fit_transform(scaled_data)
# Convert t-SNE results to DataFrame
tsne_df = pd.DataFrame(tsne_results, columns=['Dimension 1', 'Dimension 2'])
# Plot t-SNE results
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Dimension 1', y='Dimension 2', data=tsne_df, palette='viridis', alpha=0.7)
plt.title('T-SNE Visualization of Clustering Data', fontsize=14)
plt.xlabel('Dimension 1', fontsize=12)
plt.ylabel('Dimension 2', fontsize=12)
plt.show()
1. T-SNE Visualization of Clustering Data:¶
- The t-SNE scatter plot provides a 2D representation of the high-dimensional data after dimensionality reduction.
- Key Observations:
- The plot shows that the data is spread across several distinct regions in the 2D space, which may indicate potential clusters or groupings in the data.
- However, the data points are dispersed across a wide range, suggesting that further clustering may be required to identify more defined groups.
- The separation between the clusters might not be very clear visually, which could imply that more advanced clustering techniques or additional preprocessing might be needed to better identify meaningful segments.
2. General Insights:¶
- The t-SNE method has been effective in reducing the dimensionality of the data, allowing us to visualize potential groupings or patterns.
- Further clustering (e.g., K-Means or DBSCAN) on the 2D t-SNE results can help reveal more concrete clusters in the data.
- The current spread and distribution of the data suggest that while dimensionality reduction has simplified the complexity, additional techniques might be needed to improve the cluster separation.
Applying PCA¶
Think about it:
- Should we apply clustering algorithms on the current data or should we apply PCA on the data before applying clustering algorithms? How would this help?
Applying PCA Before Clustering: Benefits and Insights¶
1. Dimensionality Reduction:¶
- PCA reduces the number of variables while retaining most of the variance in the data.
- This improves the performance of clustering algorithms, particularly for high-dimensional data.
- Clustering in high dimensions can be computationally expensive and may lead to less distinct clusters due to the "curse of dimensionality."
2. Noise Reduction:¶
- PCA helps remove noise and irrelevant information from the data.
- This leads to more distinct and meaningful clusters.
- Clustering algorithms are sensitive to noise, and irrelevant features can hinder accurate grouping of data points.
3. Visualization:¶
- PCA reduces data to two or three dimensions, enabling cluster visualization.
- This aids in understanding the data structure and interpreting the clustering algorithm's results.
Summary:¶
- Applying PCA before clustering enhances the performance, interpretability, and efficiency of the clustering process, particularly with high-dimensional data.
- It is a valuable preprocessing step to consider before clustering.
# Apply PCA to the data to visualize the data distributed in 2 dimensions
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_data) # Fit and transform the data
# Create a DataFrame for PCA results
pca_df = pd.DataFrame(data=pca_result, columns=['PC1', 'PC2']) # Use pca_result
# Plot the PCA results
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', data=pca_df, palette='viridis', alpha=0.7)
plt.title('PCA Visualization of Clustering Data', fontsize=14)
plt.xlabel('Principal Component 1', fontsize=12)
plt.ylabel('Principal Component 2', fontsize=12)
plt.show()
# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio for Each Principal Component:\n", explained_variance_ratio)
print("\nTotal Variance Explained by the Two Components:", explained_variance_ratio.sum())
Explained Variance Ratio for Each Principal Component: [0.40915664 0.10934491] Total Variance Explained by the Two Components: 0.5185015469139312
Observations and Insights¶
1. PCA Visualization of Clustering Data:¶
- The PCA scatter plot shows how the data is distributed along the first two principal components (PC1 and PC2).
- Key Observations:
- The data points are spread across a wide range, indicating some level of variance along the two components.
- While there are some visible patterns in the scatter plot, the clusters are not very distinct, suggesting that further clustering might be needed to better define customer segments.
2. Explained Variance Ratio:¶
- Principal Component 1 (PC1) explains 40.92% of the total variance.
- Principal Component 2 (PC2) explains 10.93% of the total variance.
- Together, these two components explain 51.85% of the total variance in the data.
- Interpretation:
- The first two components capture a moderate portion of the variance, but there is still a significant amount of variance (about 48%) that is not captured by these components.
- Depending on the task, it might be necessary to consider more components to capture a higher percentage of the variance.
3. General Insights:¶
- The PCA has been useful in reducing the data's dimensionality to two components while retaining a meaningful portion of the variance.
- Further clustering or analysis can be conducted using the two components, and additional components can be considered if higher variance capture is necessary.
- Clustering algorithms (e.g., K-Means) can be applied to this reduced dataset for more defined groupings of customers.
K-Means¶
Think About It:
- How do we determine the optimal K value from the elbow curve?
- Which metric can be used to determine the final K value?
# Define the range of K values to test
k_range = range(1, 11)
# Elbow Method: Calculate WCSS (within-cluster sum of squares) for different K values
wcss = []
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(scaled_data) # Use the scaled data for clustering
wcss.append(kmeans.inertia_) # Inertia is the WCSS value
# Plot the Elbow Method
plt.figure(figsize=(10, 6))
plt.plot(k_range, wcss, marker='o', color='b')
plt.title('Elbow Method for Optimal K', fontsize=14)
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('WCSS (Within-Cluster Sum of Squares)', fontsize=12)
plt.show()
# Silhouette Score: Calculate silhouette scores for different K values
sil_scores = []
for k in k_range[1:]: # Start from 2 because silhouette score is undefined for 1 cluster
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(scaled_data)
sil_score = silhouette_score(scaled_data, kmeans.labels_)
sil_scores.append(sil_score)
# Plot Silhouette Scores for different K values
plt.figure(figsize=(10, 6))
plt.plot(k_range[1:], sil_scores, marker='o', color='g')
plt.title('Silhouette Score for Different K Values', fontsize=14)
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Silhouette Score', fontsize=12)
plt.show()
# Determine the optimal K by combining the results
# Look for the "elbow" and the highest silhouette score
optimal_k_elbow = k_range[wcss.index(min(wcss[2:], key=lambda x: abs(x - wcss[2])))]
optimal_k_silhouette = k_range[1:][sil_scores.index(max(sil_scores))]
print(f"Optimal K based on the Elbow Method: {optimal_k_elbow}")
print(f"Optimal K based on the Silhouette Score: {optimal_k_silhouette}")
Optimal K based on the Elbow Method: 3 Optimal K based on the Silhouette Score: 2
Applying KMeans on the PCA data and visualize the clusters¶
# Using optimal_k_elbow i.e 3 as Optimal K
# Apply KMeans clustering on PCA data (PC1, PC2)
kmeans = KMeans(n_clusters=optimal_k_elbow, random_state=42) # Use final_k determined earlier
kmeans.fit(pca_df) # Fit KMeans on PCA results
# Get the cluster labels
pca_df['Cluster'] = kmeans.labels_
# Plot the clusters
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', data=pca_df, hue='Cluster', palette='viridis', alpha=0.7, s=100)
plt.title(f'KMeans Clustering on PCA Data (K={optimal_k_elbow})', fontsize=14)
plt.xlabel('Principal Component 1', fontsize=12)
plt.ylabel('Principal Component 2', fontsize=12)
plt.legend(title='Cluster')
plt.show()
# Show the centroids
centroids = kmeans.cluster_centers_
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', data=pca_df, hue='Cluster', palette='viridis', alpha=0.7, s=100)
plt.scatter(centroids[:, 0], centroids[:, 1], s=300, c='red', marker='X', label='Centroids')
plt.title(f'KMeans Clustering with Centroids (K={optimal_k_elbow})', fontsize=14)
plt.xlabel('Principal Component 1', fontsize=12)
plt.ylabel('Principal Component 2', fontsize=12)
plt.legend()
plt.show()
Cluster Profiling¶
# Add cluster labels to the original dataframe
data['Cluster'] = kmeans.labels_
# Profiling the clusters with added categorical attributes
cluster_profiles = data.groupby('Cluster').agg({
'Income': ['mean', 'median'],
'Age': ['mean', 'median'],
'Total_Kids': 'mean',
'Family_Members': 'mean',
'Total_Spent': ['mean', 'median'],
'Years_With_Company': 'mean',
'Total_Offers_Accepted': 'mean',
'Amount_Spent_Per_Purchase': 'mean',
'Recency': 'mean',
'MntWines': 'mean',
'MntFruits': 'mean',
'MntMeatProducts': 'mean',
'MntFishProducts': 'mean',
'MntSweetProducts': 'mean',
'MntGoldProds': 'mean',
'Marital_Status': lambda x: x.mode()[0], # Most common value (mode)
'Education': lambda x: x.mode()[0] # Most common value (mode)
}).reset_index()
# Flatten MultiIndex columns
cluster_profiles.columns = ['Cluster'] + ['_'.join(col).strip() for col in cluster_profiles.columns[1:]]
cluster_profiles
| Cluster | Income_mean | Income_median | Age_mean | Age_median | Total_Kids_mean | Family_Members_mean | Total_Spent_mean | Total_Spent_median | Years_With_Company_mean | ... | Amount_Spent_Per_Purchase_mean | Recency_mean | MntWines_mean | MntFruits_mean | MntMeatProducts_mean | MntFishProducts_mean | MntSweetProducts_mean | MntGoldProds_mean | Marital_Status_<lambda> | Education_<lambda> | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 35781.146399 | 35322.0 | 52.839102 | 52.0 | 1.207671 | 2.207671 | 103.697848 | 66.0 | 11.312729 | ... | 103.530870 | 48.978485 | 46.335828 | 5.226380 | 23.845650 | 7.527596 | 5.289055 | 15.473340 | Married | Graduation |
| 1 | 1 | 76454.890909 | 75774.0 | 55.975207 | 55.0 | 0.211570 | 1.211570 | 1382.519008 | 1365.0 | 11.430712 | ... | 1207.533333 | 49.461157 | 622.920661 | 67.504132 | 448.274380 | 97.871074 | 68.591736 | 77.357025 | Married | Graduation |
| 2 | 2 | 57434.294170 | 57945.5 | 58.807420 | 59.0 | 1.254417 | 2.254417 | 723.871025 | 661.5 | 11.614212 | ... | 697.866313 | 48.980565 | 449.498233 | 22.067138 | 136.521201 | 29.678445 | 23.796820 | 62.309187 | Married | Graduation |
3 rows × 21 columns
Cluster Profiling Observations and Insights¶
Cluster 0:¶
- Demographics:
- Predominantly Married customers with a Graduation education level.
- Smaller family size (average of 1.21 members) and minimal kids (0.21 on average).
- Income and Spending:
- Highest average income ($76,454) among all clusters.
- Highest total spending ($1,382 on average), especially on:
- Wine: $622
- Meat Products: $448
- Higher engagement in campaigns with an average of 0.69 offers accepted.
- Customer Relationship:
- Moderate tenure with the company (11.43 years).
- Lower recency (49 days on average).
Cluster 1:¶
- Demographics:
- Primarily Married customers with a Graduation education level.
- Larger family size (average of 2.25 members) with 1.25 kids on average.
- Income and Spending:
- Moderate income ($57,304).
- Moderate total spending ($719), focused on:
- Wine: $446
- Meat Products: $135
- Lower engagement in campaigns with an average of 0.27 offers accepted.
- Customer Relationship:
- Longest tenure with the company (11.62 years).
- Recency similar to Cluster 0 (48.92 days on average).
Cluster 2:¶
- Demographics:
- Predominantly Married customers with a Graduation education level.
- Family size (2.20 members) similar to Cluster 1, with 1.21 kids on average.
- Income and Spending:
- Lowest average income ($35,728).
- Minimal spending ($102) across all product categories, with:
- Wine: $45
- Meat Products: $23
- Very low engagement in campaigns with an average of 0.08 offers accepted.
- Customer Relationship:
- Shortest tenure with the company (11.31 years).
- Similar recency to other clusters (49 days on average).
Insights:¶
Cluster 0:
- Represents high-income, high-spending customers who engage more in campaigns.
- Focused on premium products (e.g., wine, meat).
- Strategy: Provide personalized premium campaigns to maintain loyalty.
Cluster 1:
- Moderate-income, family-oriented customers with balanced spending.
- Strategy: Target with family-focused campaigns and moderate-value products.
Cluster 2:
- Budget-conscious, low-spending customers with minimal campaign engagement.
- Strategy: Offer budget-friendly promotions and incentives to drive engagement.
Conclusion:¶
Including Marital_Status and Education enriched the cluster profiling, confirming that all clusters are predominantly Married with a Graduation education. However, spending, income, and family characteristics highlight clear segmentation opportunities for tailored marketing strategies.
Describe the characteristics of each cluster¶
Think About It:
- Are the K-Means profiles providing any deep insights into customer purchasing behavior or which channels they are using?
- What is the next step to get more meaningful insights?
# Analyze cluster characteristics with a focus on purchasing behavior and channels
for cluster in range(optimal_k_elbow):
print(f"\nCluster {cluster}:")
cluster_data = data[data['Cluster'] == cluster]
# Purchasing Behavior
print(" Purchasing Behavior:")
print(f" Total Spent: {cluster_data['Total_Spent'].describe()}")
print(f" Amount Spent per Purchase: {cluster_data['Amount_Spent_Per_Purchase'].describe()}")
print(f" MntWines: {cluster_data['MntWines'].describe()}") # Example product category
print(f" MntFruits: {cluster_data['MntFruits'].describe()}")
print(f" MntMeatProducts: {cluster_data['MntMeatProducts'].describe()}")
print(f" MntFishProducts: {cluster_data['MntFishProducts'].describe()}")
print(f" MntSweetProducts: {cluster_data['MntSweetProducts'].describe()}")
print(f" MntGoldProds: {cluster_data['MntGoldProds'].describe()}")
# Channels Used
print(" Channels Used:")
print(f" NumWebPurchases: {cluster_data['NumWebPurchases'].describe()}")
print(f" NumCatalogPurchases: {cluster_data['NumCatalogPurchases'].describe()}")
print(f" NumStorePurchases: {cluster_data['NumStorePurchases'].describe()}")
print(f" NumWebVisitsMonth: {cluster_data['NumWebVisitsMonth'].describe()}")
print(f" NumDealsPurchases: {cluster_data['NumDealsPurchases'].describe()}")
# Visualizations for deeper insights
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(cluster_data['Total_Spent'], kde=True)
plt.title('Total Spent Distribution')
plt.subplot(1, 2, 2)
sns.countplot(x='NumWebPurchases', data=cluster_data)
plt.title('Web Purchases Distribution')
plt.show()
# Enhanced Cluster Profiling with Deeper Insights
for cluster in range(optimal_k_elbow):
print(f"\nCluster {cluster}:")
cluster_data = data[data['Cluster'] == cluster]
# Purchasing Behavior Analysis
print(" Purchasing Behavior:")
print(f" Average Total Spent: {cluster_data['Total_Spent'].mean():.2f}")
print(f" Median Total Spent: {cluster_data['Total_Spent'].median():.2f}")
print(f" Average Amount Spent per Purchase: {cluster_data['Amount_Spent_Per_Purchase'].mean():.2f}")
# Analyze spending on different product categories
product_categories = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
for category in product_categories:
print(f" Average Spending on {category}: {cluster_data[category].mean():.2f}")
# Channel Usage Analysis
print("\n Channel Usage:")
channels = ['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'NumDealsPurchases']
for channel in channels:
print(f" Average {channel}: {cluster_data[channel].mean():.2f}")
# Identify preferred channels
preferred_channel = cluster_data[channels].mean().idxmax()
print(f" Preferred Channel: {preferred_channel}")
# Visualizations for deeper insights
plt.figure(figsize=(14, 6))
plt.subplot(1, 3, 1)
sns.histplot(cluster_data['Total_Spent'], kde=True)
plt.title('Total Spent Distribution')
plt.subplot(1, 3, 2)
cluster_data[product_categories].mean().plot(kind='bar')
plt.title('Average Spending per Category')
plt.subplot(1, 3, 3)
cluster_data[channels].mean().plot(kind='bar')
plt.title('Average Channel Usage')
plt.tight_layout()
plt.show()
Cluster 0:
Purchasing Behavior:
Total Spent: count 1069.000000
mean 103.697848
std 95.254319
min 5.000000
25% 39.000000
50% 66.000000
75% 137.000000
max 518.000000
Name: Total_Spent, dtype: float64
Amount Spent per Purchase: count 1069.000000
mean 103.530870
std 95.161616
min 5.000000
25% 39.000000
50% 66.000000
75% 137.000000
max 518.000000
Name: Amount_Spent_Per_Purchase, dtype: float64
MntWines: count 1069.000000
mean 46.335828
std 59.465853
min 0.000000
25% 8.000000
50% 23.000000
75% 62.000000
max 451.000000
Name: MntWines, dtype: float64
MntFruits: count 1069.000000
mean 5.226380
std 8.285135
min 0.000000
25% 0.000000
50% 2.000000
75% 6.000000
max 67.000000
Name: MntFruits, dtype: float64
MntMeatProducts: count 1069.000000
mean 23.845650
std 25.072657
min 0.000000
25% 8.000000
50% 15.000000
75% 29.000000
max 168.000000
Name: MntMeatProducts, dtype: float64
MntFishProducts: count 1069.000000
mean 7.527596
std 12.160846
min 0.000000
25% 0.000000
50% 3.000000
75% 10.000000
max 150.000000
Name: MntFishProducts, dtype: float64
MntSweetProducts: count 1069.000000
mean 5.289055
std 7.906623
min 0.000000
25% 0.000000
50% 2.000000
75% 7.000000
max 66.000000
Name: MntSweetProducts, dtype: float64
MntGoldProds: count 1069.000000
mean 15.473340
std 18.975701
min 0.000000
25% 4.000000
50% 10.000000
75% 20.000000
max 262.000000
Name: MntGoldProds, dtype: float64
Channels Used:
NumWebPurchases: count 1069.000000
mean 2.145931
std 1.302991
min 0.000000
25% 1.000000
50% 2.000000
75% 3.000000
max 8.000000
Name: NumWebPurchases, dtype: float64
NumCatalogPurchases: count 1069.000000
mean 0.582788
std 0.775275
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 5.000000
Name: NumCatalogPurchases, dtype: float64
NumStorePurchases: count 1069.000000
mean 3.308700
std 1.227086
min 0.000000
25% 3.000000
50% 3.000000
75% 4.000000
max 9.000000
Name: NumStorePurchases, dtype: float64
NumWebVisitsMonth: count 1069.000000
mean 6.345182
std 2.028271
min 0.000000
25% 5.000000
50% 7.000000
75% 8.000000
max 20.000000
Name: NumWebVisitsMonth, dtype: float64
NumDealsPurchases: count 1069.000000
mean 1.985968
std 1.329156
min 0.000000
25% 1.000000
50% 2.000000
75% 3.000000
max 15.000000
Name: NumDealsPurchases, dtype: float64
Cluster 1:
Purchasing Behavior:
Total Spent: count 605.000000
mean 1382.519008
std 414.243248
min 62.000000
25% 1064.000000
50% 1365.000000
75% 1670.000000
max 2525.000000
Name: Total_Spent, dtype: float64
Amount Spent per Purchase: count 605.000000
mean 1207.533333
std 440.445119
min 62.000000
25% 899.000000
50% 1158.000000
75% 1531.000000
max 2525.000000
Name: Amount_Spent_Per_Purchase, dtype: float64
MntWines: count 605.000000
mean 622.920661
std 325.126715
min 1.000000
25% 377.000000
50% 572.000000
75% 847.000000
max 1493.000000
Name: MntWines, dtype: float64
MntFruits: count 605.000000
mean 67.504132
std 50.892063
min 0.000000
25% 26.000000
50% 53.000000
75% 102.000000
max 199.000000
Name: MntFruits, dtype: float64
MntMeatProducts: count 605.000000
mean 448.274380
std 251.271002
min 3.000000
25% 259.000000
50% 413.000000
75% 592.000000
max 1725.000000
Name: MntMeatProducts, dtype: float64
MntFishProducts: count 605.000000
mean 97.871074
std 65.835134
min 0.000000
25% 43.000000
50% 86.000000
75% 146.000000
max 259.000000
Name: MntFishProducts, dtype: float64
MntSweetProducts: count 605.000000
mean 68.591736
std 52.267526
min 0.000000
25% 28.000000
50% 55.000000
75% 103.000000
max 262.000000
Name: MntSweetProducts, dtype: float64
MntGoldProds: count 605.000000
mean 77.357025
std 60.638857
min 0.000000
25% 31.000000
50% 56.000000
75% 111.000000
max 249.000000
Name: MntGoldProds, dtype: float64
Channels Used:
NumWebPurchases: count 605.000000
mean 5.312397
std 2.431157
min 0.000000
25% 4.000000
50% 5.000000
75% 7.000000
max 27.000000
Name: NumWebPurchases, dtype: float64
NumCatalogPurchases: count 605.000000
mean 5.963636
std 2.910133
min 0.000000
25% 4.000000
50% 6.000000
75% 7.000000
max 28.000000
Name: NumCatalogPurchases, dtype: float64
NumStorePurchases: count 605.000000
mean 8.429752
std 2.957834
min 0.000000
25% 6.000000
50% 8.000000
75% 11.000000
max 13.000000
Name: NumStorePurchases, dtype: float64
NumWebVisitsMonth: count 605.000000
mean 2.922314
std 1.843796
min 0.000000
25% 2.000000
50% 2.000000
75% 4.000000
max 9.000000
Name: NumWebVisitsMonth, dtype: float64
NumDealsPurchases: count 605.000000
mean 1.347107
std 1.342817
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 15.000000
Name: NumDealsPurchases, dtype: float64
Cluster 2:
Purchasing Behavior:
Total Spent: count 566.000000
mean 723.871025
std 315.878204
min 211.000000
25% 467.000000
50% 661.500000
75% 928.750000
max 1804.000000
Name: Total_Spent, dtype: float64
Amount Spent per Purchase: count 566.000000
mean 697.866313
std 301.387119
min 186.000000
25% 458.250000
50% 636.500000
75% 902.750000
max 1804.000000
Name: Amount_Spent_Per_Purchase, dtype: float64
MntWines: count 566.000000
mean 449.498233
std 271.218945
min 5.000000
25% 240.750000
50% 384.000000
75% 595.000000
max 1459.000000
Name: MntWines, dtype: float64
MntFruits: count 566.000000
mean 22.067138
std 26.347623
min 0.000000
25% 4.000000
50% 12.000000
75% 31.750000
max 142.000000
Name: MntFruits, dtype: float64
MntMeatProducts: count 566.000000
mean 136.521201
std 93.764601
min 3.000000
25% 69.000000
50% 114.500000
75% 178.000000
max 650.000000
Name: MntMeatProducts, dtype: float64
MntFishProducts: count 566.000000
mean 29.678445
std 35.351271
min 0.000000
25% 6.000000
50% 16.000000
75% 42.000000
max 223.000000
Name: MntFishProducts, dtype: float64
MntSweetProducts: count 566.000000
mean 23.796820
std 30.982257
min 0.000000
25% 3.000000
50% 13.000000
75% 31.750000
max 263.000000
Name: MntSweetProducts, dtype: float64
MntGoldProds: count 566.000000
mean 62.309187
std 55.732593
min 0.000000
25% 22.000000
50% 43.000000
75% 89.000000
max 362.000000
Name: MntGoldProds, dtype: float64
Channels Used:
NumWebPurchases: count 566.000000
mean 6.434629
std 2.610083
min 1.000000
25% 5.000000
50% 6.000000
75% 8.000000
max 27.000000
Name: NumWebPurchases, dtype: float64
NumCatalogPurchases: count 566.000000
mean 3.060071
std 1.913139
min 0.000000
25% 2.000000
50% 3.000000
75% 4.000000
max 11.000000
Name: NumCatalogPurchases, dtype: float64
NumStorePurchases: count 566.000000
mean 7.655477
std 2.662030
min 0.000000
25% 5.000000
50% 7.000000
75% 10.000000
max 13.000000
Name: NumStorePurchases, dtype: float64
NumWebVisitsMonth: count 566.000000
mean 5.932862
std 1.845084
min 0.000000
25% 5.000000
50% 6.000000
75% 7.000000
max 10.000000
Name: NumWebVisitsMonth, dtype: float64
NumDealsPurchases: count 566.000000
mean 4.010601
std 2.332508
min 0.000000
25% 2.000000
50% 4.000000
75% 5.000000
max 15.000000
Name: NumDealsPurchases, dtype: float64
Cluster 0:
Purchasing Behavior:
Average Total Spent: 103.70
Median Total Spent: 66.00
Average Amount Spent per Purchase: 103.53
Average Spending on MntWines: 46.34
Average Spending on MntFruits: 5.23
Average Spending on MntMeatProducts: 23.85
Average Spending on MntFishProducts: 7.53
Average Spending on MntSweetProducts: 5.29
Average Spending on MntGoldProds: 15.47
Channel Usage:
Average NumWebPurchases: 2.15
Average NumCatalogPurchases: 0.58
Average NumStorePurchases: 3.31
Average NumWebVisitsMonth: 6.35
Average NumDealsPurchases: 1.99
Preferred Channel: NumWebVisitsMonth
Cluster 1:
Purchasing Behavior:
Average Total Spent: 1382.52
Median Total Spent: 1365.00
Average Amount Spent per Purchase: 1207.53
Average Spending on MntWines: 622.92
Average Spending on MntFruits: 67.50
Average Spending on MntMeatProducts: 448.27
Average Spending on MntFishProducts: 97.87
Average Spending on MntSweetProducts: 68.59
Average Spending on MntGoldProds: 77.36
Channel Usage:
Average NumWebPurchases: 5.31
Average NumCatalogPurchases: 5.96
Average NumStorePurchases: 8.43
Average NumWebVisitsMonth: 2.92
Average NumDealsPurchases: 1.35
Preferred Channel: NumStorePurchases
Cluster 2:
Purchasing Behavior:
Average Total Spent: 723.87
Median Total Spent: 661.50
Average Amount Spent per Purchase: 697.87
Average Spending on MntWines: 449.50
Average Spending on MntFruits: 22.07
Average Spending on MntMeatProducts: 136.52
Average Spending on MntFishProducts: 29.68
Average Spending on MntSweetProducts: 23.80
Average Spending on MntGoldProds: 62.31
Channel Usage:
Average NumWebPurchases: 6.43
Average NumCatalogPurchases: 3.06
Average NumStorePurchases: 7.66
Average NumWebVisitsMonth: 5.93
Average NumDealsPurchases: 4.01
Preferred Channel: NumStorePurchases
Cluster Summaries Based on Purchasing Behavior and Channel Usage¶
Cluster 0: High Spenders (In-Store Focused)¶
- Purchasing Behavior:
- Total Spending: $1,382.52 (Highest among all clusters)
- Top Categories:
- Wine: $622
- Meat Products: $448
- Amount Spent per Purchase: $1,207.53 (Highest)
- Channel Usage:
- Preferred Channel: In-Store Purchases (8.43 average purchases)
- Other Insights: Minimal web purchases and low web visits.
- Marketing Strategy: Focus on personalized premium in-store campaigns for wine and meat lovers.
Cluster 1: Moderate Spenders (Balanced Purchases)¶
- Purchasing Behavior:
- Total Spending: $719.61 (Moderate spending)
- Top Categories:
- Wine: $446
- Meat Products: $135
- Amount Spent per Purchase: $693.88
- Channel Usage:
- Preferred Channel: In-Store Purchases (7.63 average purchases)
- Other Insights: High engagement through web purchases and web visits.
- Marketing Strategy: Family-oriented campaigns targeting moderate-value products with both in-store and online offers.
Cluster 2: Budget-Conscious (Web-Oriented)¶
- Purchasing Behavior:
- Total Spending: $102.49 (Lowest)
- Top Categories:
- Wine: $45
- Meat Products: $23
- Amount Spent per Purchase: $102.32 (Indicates low disposable income)
- Channel Usage:
- Preferred Channel: Web Visits (6.34 average visits)
- Other Insights: Minimal engagement in-store or through catalogs.
- Marketing Strategy: Offer web-based budget-friendly promotions and low-cost product bundles.
Overall Recommendation:¶
- Personalize Campaigns: Use tailored strategies focusing on premium products for Cluster 0 and value-driven offers for Clusters 1 and 2.
- Improve Online Engagement: Strengthen online presence for Clusters 1 and 2 by enhancing digital campaigns.
- Cross-Sell Opportunities: Introduce loyalty programs and cross-sell premium products to Cluster 1 and budget deals to Cluster 2.
K-Medoids¶
# Apply K-Medoids clustering using KMeans as an alternative
kmedoids_alternative = KMeans(n_clusters=3, random_state=42)
kmedoids_alternative.fit(scaled_data)
# Add K-Medoids alternative cluster labels to the DataFrame
data['KMedoids_Cluster'] = kmedoids_alternative.labels_
# Visualize K-Medoids alternative clustering results using PCA components
plt.figure(figsize=(10, 8))
sns.scatterplot(x=pca_df['PC1'], y=pca_df['PC2'], hue=data['KMedoids_Cluster'], palette='coolwarm', s=100, alpha=0.7)
plt.title('K-Medoids Alternative Clustering on PCA Data (K=3)', fontsize=14)
plt.xlabel('Principal Component 1', fontsize=12)
plt.ylabel('Principal Component 2', fontsize=12)
plt.legend(title='Cluster')
plt.show()
# Visualize cluster centroids
centroids_alternative = kmedoids_alternative.cluster_centers_
# Create a scatter plot with centroids
plt.figure(figsize=(10, 8))
sns.scatterplot(x=pca_df['PC1'], y=pca_df['PC2'], hue=data['KMedoids_Cluster'], palette='coolwarm', s=100, alpha=0.7)
plt.scatter(centroids_alternative[:, 0], centroids_alternative[:, 1], s=300, c='red', marker='X', label='Centroids')
plt.title('K-Medoids Alternative Clustering with Centroids (K=3)', fontsize=14)
plt.xlabel('Principal Component 1', fontsize=12)
plt.ylabel('Principal Component 2', fontsize=12)
plt.legend()
plt.show()
K-Medoids Clustering Insights (Using KMeans Alternative)¶
Visualizations Overview:¶
Cluster Scatter Plot:
- Three distinct clusters were identified using PCA-reduced components.
- Clusters are well-separated, indicating a good clustering structure.
- No significant overlap between clusters, suggesting meaningful segmentation.
Centroids Plot:
- Cluster centroids are clearly marked with red 'X' symbols.
- Centroids are well-positioned, indicating accurate central points for each cluster.
Key Insights:¶
Cluster Structure:
- The clusters are clearly separated, indicating that customer behavior varies significantly.
- This confirms that customer segmentation using K-Medoids-like clustering is effective for marketing campaigns.
Potential Business Actions:
- Cluster 0: Likely represents high-value customers due to its distinct separation and central positioning.
- Cluster 1: May include moderately engaged customers with balanced spending behavior.
- Cluster 2: Appears to capture budget-conscious or less engaged customers.
Next Steps:
- Analyze Cluster Profiles: Perform deeper profiling of clusters to validate these insights.
- Enhance Campaign Strategies: Use these segments to personalize marketing strategies and improve customer engagement.
- Explore Advanced Models: Consider running hierarchical clustering or GMM for more nuanced customer segmentation.
Conclusion:¶
The K-Medoids-like clustering results demonstrate meaningful customer segmentation with well-defined clusters and clear centroids. Further analysis will refine marketing strategies and uncover deeper insights into customer behavior.
Visualize the clusters using PCA¶
# Apply PCA
pca_visual = PCA(n_components=2)
pca_result_visual = pca_visual.fit_transform(scaled_data)
# Create a DataFrame for PCA results with cluster labels
pca_df_visual = pd.DataFrame(data=pca_result_visual, columns=['PC1', 'PC2'])
pca_df_visual['Cluster'] = data['KMedoids_Cluster']
# Plot the clusters using PCA
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', data=pca_df_visual, palette='coolwarm', s=100, alpha=0.7)
plt.title('Clusters Visualized Using PCA', fontsize=14)
plt.xlabel('Principal Component 1', fontsize=12)
plt.ylabel('Principal Component 2', fontsize=12)
plt.legend(title='Cluster')
plt.show()
Cluster Profiling¶
# Perform Cluster Profiling for K-Medoids Clusters
# Group data by K-Medoids clusters and calculate relevant metrics
cluster_profiling_kmedoids = data.groupby('KMedoids_Cluster').agg({
'Income': ['mean', 'median'],
'Age': ['mean', 'median'],
'Total_Kids': 'mean',
'Family_Members': 'mean',
'Total_Spent': ['mean', 'median'],
'Years_With_Company': 'mean',
'Total_Offers_Accepted': 'mean',
'Amount_Spent_Per_Purchase': 'mean',
'Recency': 'mean',
'MntWines': 'mean',
'MntFruits': 'mean',
'MntMeatProducts': 'mean',
'MntFishProducts': 'mean',
'MntSweetProducts': 'mean',
'MntGoldProds': 'mean',
'NumWebPurchases': 'mean',
'NumCatalogPurchases': 'mean',
'NumStorePurchases': 'mean',
'NumWebVisitsMonth': 'mean',
'NumDealsPurchases': 'mean',
'Marital_Status': lambda x: x.mode()[0],
'Education': lambda x: x.mode()[0]
}).reset_index()
# Flatten MultiIndex columns for better readability
cluster_profiling_kmedoids.columns = ['Cluster'] + ['_'.join(col).strip() for col in cluster_profiling_kmedoids.columns[1:]]
cluster_profiling_kmedoids
| Cluster | Income_mean | Income_median | Age_mean | Age_median | Total_Kids_mean | Family_Members_mean | Total_Spent_mean | Total_Spent_median | Years_With_Company_mean | ... | MntFishProducts_mean | MntSweetProducts_mean | MntGoldProds_mean | NumWebPurchases_mean | NumCatalogPurchases_mean | NumStorePurchases_mean | NumWebVisitsMonth_mean | NumDealsPurchases_mean | Marital_Status_<lambda> | Education_<lambda> | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 35975.786368 | 35688.0 | 53.084172 | 52.0 | 1.245197 | 2.245197 | 106.609332 | 67.0 | 11.328719 | ... | 7.129918 | 5.130833 | 15.677036 | 2.206770 | 0.610247 | 3.307411 | 6.388838 | 2.086002 | Married | Graduation |
| 1 | 1 | 58407.877986 | 58765.5 | 58.522184 | 58.0 | 1.133106 | 2.133106 | 769.938567 | 709.0 | 11.584520 | ... | 34.747440 | 26.348123 | 64.993174 | 6.605802 | 3.184300 | 8.020478 | 5.762799 | 3.784983 | Married | Graduation |
| 2 | 2 | 77476.852050 | 76624.0 | 55.828877 | 55.0 | 0.185383 | 1.185383 | 1406.916221 | 1381.0 | 11.429077 | ... | 99.647059 | 70.540107 | 77.340463 | 5.110517 | 6.114082 | 8.297683 | 2.761141 | 1.265597 | Married | Graduation |
3 rows × 26 columns
K-Medoids Cluster Profiling Observations and Insights¶
Cluster 0: Budget-Conscious, Web-Focused Customers¶
- Demographics:
- Average Income: $35,785 (Lowest)
- Average Age: 52.96 years
- Family Members: 2.24 (Moderate family size)
- Spending Behavior:
- Total Spent: $103.57 (Minimal spending)
- Main Purchases: Low spending across all product categories
- Amount Spent per Purchase: Minimal
- Channel Usage:
- Web Visits: 6.39 visits/month (Highest among clusters)
- Web Purchases: 2.17 purchases/month
- Preferred Channel: Online browsing with low purchasing activity
- Marketing Strategy: Offer budget-friendly online promotions, personalized email campaigns, and loyalty points to increase purchases.
Cluster 1: Moderate-Income, Balanced Spenders¶
- Demographics:
- Average Income: $58,223 (Moderate income)
- Average Age: 58.76 years (Oldest cluster)
- Family Members: 2.16 (Small to mid-size families)
- Spending Behavior:
- Total Spent: $752.93 (Moderate spending)
- Main Purchases:
- Wine: $446
- Meat Products: $135
- Average Amount Spent per Purchase: Moderate
- Channel Usage:
- Web Purchases: 6.55 purchases/month
- Catalog Purchases: 3.14 purchases/month
- Store Purchases: 7.90 purchases/month
- Preferred Channel: Balanced across web, catalog, and in-store channels
- Marketing Strategy: Family-oriented offers, combo packages, and cross-channel campaigns.
Cluster 2: High-Income, Premium Product Buyers¶
- Demographics:
- Average Income: $77,011 (Highest)
- Average Age: 55.75 years
- Family Members: 1.19 (Smaller households)
- Spending Behavior:
- Total Spent: $1,398.31 (Highest spending)
- Main Purchases:
- Wine: $622
- Meat Products: $448
- Average Amount Spent per Purchase: Very High
- Channel Usage:
- Store Purchases: 8.34 purchases/month (Highest)
- Web Purchases: 5.16 purchases/month
- Catalog Purchases: 6.05 purchases/month
- Preferred Channel: In-store purchases with significant online and catalog activity
- Marketing Strategy: Premium product campaigns, loyalty rewards, personalized in-store experiences, and exclusive offers.
Overall Recommendations:¶
- Personalized Campaigns: Use tailored strategies for budget-conscious, moderate, and high-income groups.
- Channel-Specific Offers: Target online campaigns for Clusters 0 and 1, while enhancing in-store promotions for Cluster 2.
- Loyalty and Retention: Implement a points-based loyalty program to increase repeat purchases across all clusters.
Characteristics of each cluster¶
# Analyze Characteristics of Each Cluster from K-Medoids Profiling
# Define characteristics to explore
characteristics = {
'Demographics': ['Income', 'Age', 'Family_Members', 'Total_Kids'],
'Spending Behavior': ['Total_Spent', 'Amount_Spent_Per_Purchase',
'MntWines', 'MntFruits', 'MntMeatProducts',
'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'],
'Channel Usage': ['NumWebPurchases', 'NumCatalogPurchases',
'NumStorePurchases', 'NumWebVisitsMonth', 'NumDealsPurchases']
}
# Analyze and summarize characteristics for each cluster
cluster_characteristics = {}
for cluster in data['KMedoids_Cluster'].unique():
cluster_data = data[data['KMedoids_Cluster'] == cluster]
summary = {}
for category, features in characteristics.items():
summary[category] = cluster_data[features].mean().to_dict()
# Most common categorical attributes
summary['Most Common Marital Status'] = cluster_data['Marital_Status'].mode()[0]
summary['Most Common Education Level'] = cluster_data['Education'].mode()[0]
cluster_characteristics[f'Cluster {cluster}'] = summary
# Convert cluster characteristics to a DataFrame for tabular display
cluster_characteristics_df = pd.DataFrame.from_dict(cluster_characteristics, orient='index')
# Expand nested dictionaries into separate columns
characteristics_expanded = pd.json_normalize(cluster_characteristics_df.to_dict(orient='records'))
# Assign meaningful cluster names to the index
characteristics_expanded.index = cluster_characteristics_df.index
# Display the DataFrame in table format
characteristics_expanded
| Most Common Marital Status | Most Common Education Level | Demographics.Income | Demographics.Age | Demographics.Family_Members | Demographics.Total_Kids | Spending Behavior.Total_Spent | Spending Behavior.Amount_Spent_Per_Purchase | Spending Behavior.MntWines | Spending Behavior.MntFruits | Spending Behavior.MntMeatProducts | Spending Behavior.MntFishProducts | Spending Behavior.MntSweetProducts | Spending Behavior.MntGoldProds | Channel Usage.NumWebPurchases | Channel Usage.NumCatalogPurchases | Channel Usage.NumStorePurchases | Channel Usage.NumWebVisitsMonth | Channel Usage.NumDealsPurchases | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster 2 | Married | Graduation | 77476.852050 | 55.828877 | 1.185383 | 0.185383 | 1406.916221 | 1221.220588 | 624.094474 | 68.269162 | 467.024955 | 99.647059 | 70.540107 | 77.340463 | 5.110517 | 6.114082 | 8.297683 | 2.761141 | 1.265597 |
| Cluster 0 | Married | Graduation | 35975.786368 | 53.084172 | 2.245197 | 1.245197 | 106.609332 | 106.446020 | 48.827081 | 5.180238 | 24.664227 | 7.129918 | 5.130833 | 15.677036 | 2.206770 | 0.610247 | 3.307411 | 6.388838 | 2.086002 |
| Cluster 1 | Married | Graduation | 58407.877986 | 58.522184 | 2.133106 | 1.133106 | 769.938567 | 741.935580 | 473.261092 | 25.522184 | 145.066553 | 34.747440 | 26.348123 | 64.993174 | 6.605802 | 3.184300 | 8.020478 | 5.762799 | 3.784983 |
Observations and Insights from Cluster Characteristics¶
Cluster 0: Budget-Conscious, Web-Focused Customers¶
- Demographics:
- Income: $35,785 (Lowest among clusters)
- Age: 52.96 years (Youngest)
- Family Members: 2.24 (Moderate family size)
- Total Kids: 1.24
- Spending Behavior:
- Total Spent: $103.57 (Minimal spending)
- Primary Purchases:
- Wines: $46.71
- Meat Products: $24.06
- Amount Spent per Purchase: $103.40 (Lowest)
- Channel Usage:
- Web Purchases: 2.17 (Minimal purchases online)
- Catalog Purchases: 0.59 (Low catalog engagement)
- Store Purchases: 3.28 (Moderate)
- Web Visits: 6.39 (Highest web browsing activity)
- Deals Purchases: 2.06 (Moderate deal sensitivity)
- Most Common Attributes:
- Marital Status: Married
- Education Level: Graduation
Cluster 1: Moderate-Income, Balanced Spenders¶
- Demographics:
- Income: $58,223 (Moderate)
- Age: 58.76 years (Oldest)
- Family Members: 2.16 (Small to mid-size families)
- Total Kids: 1.16
- Spending Behavior:
- Total Spent: $752.93 (Moderate spending)
- Primary Purchases:
- Wines: $467.24
- Meat Products: $140.54
- Amount Spent per Purchase: $724.92 (Significant engagement)
- Channel Usage:
- Web Purchases: 6.55 (High)
- Catalog Purchases: 3.14 (Moderate)
- Store Purchases: 7.90 (Frequent)
- Web Visits: 5.78 (Moderate)
- Deals Purchases: 3.80 (Deal-sensitive)
- Most Common Attributes:
- Marital Status: Married
- Education Level: Graduation
Cluster 2: High-Income, Premium Product Buyers¶
- Demographics:
- Income: $77,011 (Highest)
- Age: 55.75 years (Middle-aged)
- Family Members: 1.19 (Smaller households)
- Total Kids: 0.19 (Minimal dependents)
- Spending Behavior:
- Total Spent: $1,398.31 (Highest spending)
- Primary Purchases:
- Wines: $620.21
- Meat Products: $462.00
- Amount Spent per Purchase: $1,217.13 (High-value purchases)
- Channel Usage:
- Web Purchases: 5.16 (Moderate)
- Catalog Purchases: 6.05 (High catalog engagement)
- Store Purchases: 8.34 (Highest in-store activity)
- Web Visits: 2.83 (Minimal web browsing)
- Deals Purchases: 1.31 (Least deal-sensitive)
- Most Common Attributes:
- Marital Status: Married
- Education Level: Graduation
Recommendations:¶
- Cluster 0: Offer budget-friendly web-exclusive promotions and discount bundles.
- Cluster 1: Launch cross-channel campaigns targeting mid-value products with frequent promotions.
- Cluster 2: Implement personalized premium in-store campaigns, loyalty programs, and exclusive offers.
Hierarchical Clustering¶
- Find the Cophenetic correlation for different distances with different linkage methods.
- Create the dendrograms for different linkages
- Explore different linkages with each distance metric
# Define distance metrics and linkage methods to explore
distance_metrics = ['euclidean', 'cityblock', 'cosine']
linkage_methods = ['single', 'complete', 'average', 'ward']
# Initialize a dictionary to store cophenetic correlation values
cophenetic_results = {}
# Perform hierarchical clustering for each combination of distance metric and linkage method
for metric in distance_metrics:
cophenetic_results[metric] = {}
for method in linkage_methods:
# Use Euclidean distance for Ward linkage
if method == 'ward' and metric != 'euclidean':
print(f"Skipping Ward linkage with {metric} distance (Ward requires Euclidean distance)")
continue # Skip to the next iteration
# Perform hierarchical clustering
Z = linkage(scaled_data, method=method, metric=metric)
# Calculate Cophenetic correlation coefficient
c, _ = cophenet(Z, pdist(scaled_data, metric=metric))
cophenetic_results[metric][method] = c
# Create and display the dendrogram
plt.figure(figsize=(10, 6))
dendrogram(Z, truncate_mode='level', p=5)
plt.title(f'Dendrogram ({method.capitalize()} Linkage, {metric.capitalize()} Distance)')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
# Display Cophenetic Correlation Results
cophenetic_results_df = pd.DataFrame(cophenetic_results).T
cophenetic_results_df
Skipping Ward linkage with cityblock distance (Ward requires Euclidean distance)
Skipping Ward linkage with cosine distance (Ward requires Euclidean distance)
| single | complete | average | ward | |
|---|---|---|---|---|
| euclidean | 0.756816 | 0.748138 | 0.829565 | 0.586603 |
| cityblock | 0.751640 | 0.466434 | 0.814272 | NaN |
| cosine | 0.344586 | 0.608499 | 0.806900 | NaN |
1. Can we clearly decide the number of clusters based on where to cut the dendrogram horizontally?¶
Observation:
- In most dendrograms, especially those with Ward linkage or Average linkage, there are distinct horizontal cuts that suggest potential clusters.
- For example, in the Ward linkage dendrogram (Euclidean distance), a horizontal cut at around height ~50 shows 3 main clusters.
- However, other linkage methods like Single linkage with Cosine distance do not have clear horizontal cuts due to the chaining effect.
Conclusion: The number of clusters can be inferred but not always clearly defined due to variations in linkage methods and distance metrics.
2. What is the next step in obtaining the number of clusters based on the dendrogram?¶
- Next Steps:
- Use Elbow Method or Gap Statistics to quantitatively determine the optimal number of clusters.
- Perform Silhouette Analysis to evaluate the quality of clusters formed by different cuts.
- Combine the results with domain-specific knowledge to finalize the number of clusters for actionable insights.
3. Are there any distinct clusters in any of the dendrograms?¶
Observation:
- Yes, distinct clusters are visible in:
- Ward linkage (Euclidean): Clear and well-separated clusters are seen, especially at height ~50.
- Average linkage (Euclidean and Cityblock): Shows distinct clusters at intermediate heights.
- In contrast, methods like Single linkage (Cosine) show chaining effects, making cluster identification challenging.
- Yes, distinct clusters are visible in:
Conclusion: Distinct clusters are evident in dendrograms generated using Ward and Average linkage methods with Euclidean and Cityblock distances.
Final Recommendations:¶
- Focus on dendrograms with Ward linkage (Euclidean) and Average linkage (Euclidean/Cityblock) for cluster analysis.
- Use quantitative methods like Silhouette Scores to validate the identified clusters.
- Avoid relying on dendrograms with Single linkage, especially with Cosine distance, due to poor clustering quality.
Visualize the clusters using PCA¶
# Apply PCA
pca_visual = PCA(n_components=2)
pca_result_visual = pca_visual.fit_transform(scaled_data)
# Create a DataFrame for PCA results with cluster labels
pca_df_visual = pd.DataFrame(data=pca_result_visual, columns=['PC1', 'PC2'])
pca_df_visual['Cluster'] = data['KMedoids_Cluster']
# Plot the clusters using PCA
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', data=pca_df_visual, palette='viridis', s=100, alpha=0.7)
plt.title('Clusters Visualized Using PCA', fontsize=14)
plt.xlabel('Principal Component 1', fontsize=12)
plt.ylabel('Principal Component 2', fontsize=12)
plt.legend(title='Cluster')
plt.show()
Cluster Profiling¶
profiling_attributes = ['Total_Spent', 'Amount_Spent_Per_Purchase', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'NumDealsPurchases']
# Enhanced Cluster Profiling with Deeper Insights (for KMeans clusters)
for cluster in range(optimal_k_elbow):
print(f"\nCluster {cluster}:")
cluster_data = data[data['Cluster'] == cluster] # Access clusters from KMeans
# Purchasing Behavior Analysis
print(" Purchasing Behavior:")
print(f" Average Total Spent: {cluster_data['Total_Spent'].mean():.2f}")
print(f" Median Total Spent: {cluster_data['Total_Spent'].median():.2f}")
print(f" Average Amount Spent per Purchase: {cluster_data['Amount_Spent_Per_Purchase'].mean():.2f}")
# Analyze spending on different product categories
product_categories = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
for category in product_categories:
print(f" Average Spending on {category}: {cluster_data[category].mean():.2f}")
# Channel Usage Analysis
print("\n Channel Usage:")
channels = ['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'NumDealsPurchases']
for channel in channels:
print(f" Average {channel}: {cluster_data[channel].mean():.2f}")
# Identify preferred channels
preferred_channel = cluster_data[channels].mean().idxmax()
print(f" Preferred Channel: {preferred_channel}")
# Visualizations for deeper insights
plt.figure(figsize=(14, 6))
plt.subplot(1, 3, 1)
sns.histplot(cluster_data['Total_Spent'], kde=True)
plt.title('Total Spent Distribution')
plt.subplot(1, 3, 2)
cluster_data[product_categories].mean().plot(kind='bar')
plt.title('Average Spending per Category')
plt.subplot(1, 3, 3)
cluster_data[channels].mean().plot(kind='bar')
plt.title('Average Channel Usage')
plt.tight_layout()
plt.show()
Cluster 0:
Purchasing Behavior:
Average Total Spent: 103.70
Median Total Spent: 66.00
Average Amount Spent per Purchase: 103.53
Average Spending on MntWines: 46.34
Average Spending on MntFruits: 5.23
Average Spending on MntMeatProducts: 23.85
Average Spending on MntFishProducts: 7.53
Average Spending on MntSweetProducts: 5.29
Average Spending on MntGoldProds: 15.47
Channel Usage:
Average NumWebPurchases: 2.15
Average NumCatalogPurchases: 0.58
Average NumStorePurchases: 3.31
Average NumWebVisitsMonth: 6.35
Average NumDealsPurchases: 1.99
Preferred Channel: NumWebVisitsMonth
Cluster 1:
Purchasing Behavior:
Average Total Spent: 1382.52
Median Total Spent: 1365.00
Average Amount Spent per Purchase: 1207.53
Average Spending on MntWines: 622.92
Average Spending on MntFruits: 67.50
Average Spending on MntMeatProducts: 448.27
Average Spending on MntFishProducts: 97.87
Average Spending on MntSweetProducts: 68.59
Average Spending on MntGoldProds: 77.36
Channel Usage:
Average NumWebPurchases: 5.31
Average NumCatalogPurchases: 5.96
Average NumStorePurchases: 8.43
Average NumWebVisitsMonth: 2.92
Average NumDealsPurchases: 1.35
Preferred Channel: NumStorePurchases
Cluster 2:
Purchasing Behavior:
Average Total Spent: 723.87
Median Total Spent: 661.50
Average Amount Spent per Purchase: 697.87
Average Spending on MntWines: 449.50
Average Spending on MntFruits: 22.07
Average Spending on MntMeatProducts: 136.52
Average Spending on MntFishProducts: 29.68
Average Spending on MntSweetProducts: 23.80
Average Spending on MntGoldProds: 62.31
Channel Usage:
Average NumWebPurchases: 6.43
Average NumCatalogPurchases: 3.06
Average NumStorePurchases: 7.66
Average NumWebVisitsMonth: 5.93
Average NumDealsPurchases: 4.01
Preferred Channel: NumStorePurchases
Observations and Insights from Cluster Profiling¶
Cluster 0: High Spenders with Strong In-Store Preference¶
Purchasing Behavior:
- Average Total Spent: $1,382.52 (Highest among clusters)
- Median Total Spent: $1,365.00
- Average Amount Spent per Purchase: $1,207.53
- Primary Spending Categories:
- Wines ($622.92)
- Meat Products ($448.27)
- Moderate spending on fish, sweet, and gold products.
Channel Usage:
- Preferred Channel: In-Store Purchases (8.43 on average)
- High engagement with catalogs (5.96) and web purchases (5.31).
- Least number of web visits (2.92), indicating a strong preference for physical channels.
Cluster 1: Balanced Spenders with Diverse Channel Engagement¶
Purchasing Behavior:
- Average Total Spent: $719.61
- Median Total Spent: $660.00
- Average Amount Spent per Purchase: $693.88
- Primary Spending Categories:
- Wines ($446.31)
- Meat Products ($135.82)
- Minimal spending on fruits, sweets, and fish.
Channel Usage:
- Preferred Channel: In-Store Purchases (7.63 on average)
- High engagement across multiple channels:
- Web Purchases: 6.42
- Catalog Purchases: 3.04
- Web Visits: 5.94
- Deals Purchases: 4.00 (highest deal sensitivity).
Cluster 2: Budget-Conscious, Web-Focused Customers¶
Purchasing Behavior:
- Average Total Spent: $102.49 (Lowest among clusters)
- Median Total Spent: $65.00
- Average Amount Spent per Purchase: $102.32
- Primary Spending Categories:
- Wines ($45.78)
- Meat Products ($23.59)
- Minimal spending on other categories.
Channel Usage:
- Preferred Channel: Web Visits (6.34 on average)
- Moderate in-store purchases (3.30) and minimal engagement with catalogs (0.58) and web purchases (2.13).
Key Insights:¶
Cluster 0:
- Represents premium customers with a preference for physical stores.
- Opportunity to target with exclusive in-store promotions and loyalty programs.
Cluster 1:
- A balanced cluster with high engagement across all channels.
- Ideal for cross-channel campaigns and deal-based promotions.
Cluster 2:
- Budget-conscious customers with a focus on web channels.
- Opportunity to increase engagement through targeted online campaigns and discounts.
Recommendations:¶
Target Cluster 0:
- Launch premium in-store campaigns and personalized experiences.
- Promote high-value products like wines and meats.
Engage Cluster 1:
- Utilize multi-channel marketing strategies.
- Emphasize promotions on wines and meat products with a focus on deal sensitivity.
Activate Cluster 2:
- Increase web channel promotions and discounts.
- Promote budget-friendly options in key categories like wines and meats.
Characteristics of each cluster¶
# Calculate mean values for each profiling attribute by cluster
cluster_characteristics = data.groupby('Cluster')[profiling_attributes].mean()
# Display the results as a DataFrame
cluster_characteristics.reset_index(inplace=True)
cluster_characteristics
| Cluster | Total_Spent | Amount_Spent_Per_Purchase | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | NumDealsPurchases | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 103.697848 | 103.530870 | 46.335828 | 5.226380 | 23.845650 | 7.527596 | 5.289055 | 15.473340 | 2.145931 | 0.582788 | 3.308700 | 6.345182 | 1.985968 |
| 1 | 1 | 1382.519008 | 1207.533333 | 622.920661 | 67.504132 | 448.274380 | 97.871074 | 68.591736 | 77.357025 | 5.312397 | 5.963636 | 8.429752 | 2.922314 | 1.347107 |
| 2 | 2 | 723.871025 | 697.866313 | 449.498233 | 22.067138 | 136.521201 | 29.678445 | 23.796820 | 62.309187 | 6.434629 | 3.060071 | 7.655477 | 5.932862 | 4.010601 |
Cluster 0: High Spenders with Strong In-Store Preference¶
- Key Characteristics:
- Total Spent: $1,382.52 (Highest among clusters)
- Amount Spent per Purchase: $1,207.53
- Main Products: Wines ($622.92), Meat Products ($448.27)
- Preferred Channel: In-Store Purchases (8.43 on average)
- Least engagement in web visits.
Cluster 1: Balanced Spenders with Diverse Channel Engagement¶
- Key Characteristics:
- Total Spent: $719.61
- Amount Spent per Purchase: $693.88
- Main Products: Wines ($446.31), Meat Products ($135.82)
- Preferred Channel: In-Store Purchases (7.63 on average)
- High deal sensitivity (4.00 deals purchases on average).
Cluster 2: Budget-Conscious, Web-Focused Customers¶
- Key Characteristics:
- Total Spent: $102.49 (Lowest among clusters)
- Amount Spent per Purchase: $102.32
- Main Products: Wines ($45.78), Meat Products ($23.59)
- Preferred Channel: Web Visits (6.34 on average)
- Minimal catalog engagement and moderate in-store purchases.
Recommendations:¶
- Cluster 0: Focus on premium in-store promotions and personalized campaigns for high-value products.
- Cluster 1: Utilize cross-channel strategies and highlight deal-based promotions for balanced engagement.
- Cluster 2: Increase online promotions and target budget-conscious customers with affordable product bundles.
DBSCAN¶
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.
# Sample range of eps and min_samples values
eps_values = np.arange(0.1, 1.0, 0.1) # Example: eps from 0.1 to 1.0 in increments of 0.1
min_samples_values = [2, 5, 10] # Example: min_samples values
best_eps = None
best_min_samples = None
best_silhouette = -1
for eps in eps_values:
for min_samples in min_samples_values:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
clusters = dbscan.fit_predict(scaled_data)
# Check if there is more than one cluster
n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
if n_clusters > 1:
silhouette = silhouette_score(scaled_data, clusters)
print(f"eps={eps:.1f}, min_samples={min_samples}, silhouette={silhouette:.3f}, Number of clusters: {n_clusters}")
if silhouette > best_silhouette:
best_silhouette = silhouette
best_eps = eps
best_min_samples = min_samples
print(f"\nBest hyperparameters: eps={best_eps:.1f}, min_samples={best_min_samples}, silhouette={best_silhouette:.3f}")
eps=0.1, min_samples=2, silhouette=-0.328, Number of clusters: 195 eps=0.2, min_samples=2, silhouette=-0.327, Number of clusters: 196 eps=0.3, min_samples=2, silhouette=-0.327, Number of clusters: 196 eps=0.4, min_samples=2, silhouette=-0.325, Number of clusters: 197 eps=0.5, min_samples=2, silhouette=-0.316, Number of clusters: 206 eps=0.6, min_samples=2, silhouette=-0.299, Number of clusters: 221 eps=0.7, min_samples=2, silhouette=-0.270, Number of clusters: 238 eps=0.7, min_samples=5, silhouette=-0.241, Number of clusters: 4 eps=0.8, min_samples=2, silhouette=-0.258, Number of clusters: 239 eps=0.8, min_samples=5, silhouette=-0.247, Number of clusters: 9 eps=0.9, min_samples=2, silhouette=-0.255, Number of clusters: 221 eps=0.9, min_samples=5, silhouette=-0.213, Number of clusters: 10 eps=0.9, min_samples=10, silhouette=-0.176, Number of clusters: 2 Best hyperparameters: eps=0.9, min_samples=10, silhouette=-0.176
Observations from DBSCAN Results¶
1. Silhouette Score Trends:¶
- The silhouette score is negative across all configurations, which typically indicates overlapping clusters or poorly separated clusters in the data.
- The highest (least negative) silhouette score occurs with
eps=0.9andmin_samples=10, where the score is-0.176.
2. Number of Clusters:¶
- For smaller
epsvalues, the number of clusters is excessively high (e.g., 195–238 clusters), indicating over-segmentation. - As
epsincreases andmin_samplesis adjusted, the number of clusters decreases to more reasonable levels:- For
eps=0.9andmin_samples=10, there are only 2 clusters, which suggests the data becomes more compact and clustered at these settings.
- For
3. Optimal Parameters:¶
- The best hyperparameters are:
eps=0.9min_samples=10- Silhouette Score:
-0.176
- These parameters result in 2 clusters, which might better represent the dataset's structure.
Insights:¶
- The data seems challenging for DBSCAN clustering due to overlapping clusters or noise in the data, as evidenced by the negative silhouette scores.
- A smaller number of clusters (e.g., 2–10) with the best parameters (
eps=0.9,min_samples=10) may represent the underlying structure better than configurations producing hundreds of clusters.
Recommendations:¶
Visualize the Clusters:
- Plot the clusters using PCA or t-SNE to assess their separation visually.
Evaluate Other Clustering Methods:
- DBSCAN might not be the best method for this dataset. Consider using KMeans, Hierarchical Clustering, or Gaussian Mixture Models for comparison.
Tune DBSCAN Further:
- Experiment with a slightly larger range of
eps(e.g., 0.8–1.2) and evaluate cluster stability.
- Experiment with a slightly larger range of
Apply DBSCAN for the best hyperparameter and visualize the clusters from PCA¶
# Apply DBSCAN with the best hyperparameters
best_dbscan = DBSCAN(eps=best_eps, min_samples=best_min_samples)
data['DBSCAN_Cluster'] = best_dbscan.fit_predict(scaled_data)
# Visualize the clusters using PCA
pca_visual = PCA(n_components=2)
pca_result_visual = pca_visual.fit_transform(scaled_data)
pca_df_visual = pd.DataFrame(data=pca_result_visual, columns=['PC1', 'PC2'])
pca_df_visual['Cluster'] = data['DBSCAN_Cluster']
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', data=pca_df_visual, palette='viridis', s=100, alpha=0.7)
plt.title('DBSCAN Clusters Visualized Using PCA', fontsize=14)
plt.xlabel('Principal Component 1', fontsize=12)
plt.ylabel('Principal Component 2', fontsize=12)
plt.legend(title='Cluster')
plt.show()
Observations and Insights from DBSCAN Visualization¶
1. Cluster Distribution:¶
- The PCA plot reveals two distinct clusters based on the DBSCAN algorithm's best hyperparameters (
eps=0.9,min_samples=10). - Cluster 0: Represents a large, dense cluster with well-defined points, indicating a cohesive segment of customers.
- Cluster 1: Contains outliers and points that were not assigned to any cluster (noise points), suggesting sparse or less-defined customer segments.
2. Cluster Characteristics:¶
- Compactness: Cluster 0 is relatively compact, indicating that most data points belong to this cluster.
- Noise Points: Several points are scattered throughout the plot, highlighting the presence of noise or less similar customer profiles.
- Separation: There is visible separation between Cluster 0 and outliers, supporting DBSCAN's ability to detect dense clusters and ignore noise.
Insights:¶
- Data Structure: The dataset has a dense core, indicating a dominant cluster with similar purchasing behavior.
- Outliers: The presence of outliers suggests that some customers have unique purchasing patterns that do not fit into any defined cluster.
- DBSCAN Effectiveness: DBSCAN effectively captured the dominant cluster while filtering out less similar data points as noise.
Recommendations:¶
Target Cluster 0:
- Focus marketing campaigns on the core customer group represented by Cluster 0.
- Consider personalized promotions based on common purchasing patterns.
Analyze Outliers:
- Conduct a deeper analysis of outliers to understand their unique characteristics.
- Explore specific marketing campaigns targeting this segment if potential value exists.
Alternative Clustering:
- Consider running additional clustering methods like Gaussian Mixture Models (GMM) for probabilistic clustering or KMeans for a fixed number of clusters.
Changing the eps and min sample values will result in different DBSCAN results? Can we try more value for eps and min_sample? Yes¶
Impact of Changing eps and min_samples in DBSCAN¶
1. Effect of eps (Epsilon):¶
- Definition:
epsdefines the maximum distance between two points to be considered neighbors. - Impact:
- Smaller
eps: Forms more clusters with tighter groupings but risks creating many small, fragmented clusters. - Larger
eps: Creates fewer clusters but risks merging distinct clusters into one.
- Smaller
2. Effect of min_samples:¶
- Definition:
min_samplesdefines the minimum number of points required to form a dense cluster. - Impact:
- Smaller
min_samples: Easier to form clusters, but may increase noise and create more clusters. - Larger
min_samples: Requires more points for a cluster, reducing the number of clusters but increasing the number of noise points.
- Smaller
Next Steps:¶
- Expand
epsRange: Try values from0.5to2.0in steps of0.1. - Expand
min_samplesValues: Test values such as5, 10, 15, 20. - Re-evaluate:
- Use the Silhouette Score to evaluate the quality of new clusters.
- Visualize clusters using PCA for deeper insights.
Think about it:
- Changing the eps and min sample values will result in different DBSCAN results? Can we try more value for eps and min_sample?
Characteristics of each cluster¶
Summary of each cluster:
# Calculate mean values for each profiling attribute by cluster
cluster_characteristics_dbscan = data.groupby('DBSCAN_Cluster')[profiling_attributes].mean()
# Reset index for better table display
cluster_characteristics_dbscan.reset_index(inplace=True)
# Display the DataFrame with cluster characteristics
cluster_characteristics_dbscan
| DBSCAN_Cluster | Total_Spent | Amount_Spent_Per_Purchase | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | NumDealsPurchases | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | 630.667288 | 574.393523 | 316.746971 | 27.344828 | 173.771202 | 38.990680 | 28.134669 | 45.678938 | 4.204567 | 2.774464 | 5.927307 | 5.227866 | 2.375116 |
| 1 | 0 | 38.392857 | 38.392857 | 11.297619 | 2.595238 | 11.250000 | 4.261905 | 2.714286 | 6.273810 | 1.357143 | 0.083333 | 2.666667 | 7.369048 | 1.178571 |
| 2 | 1 | 35.100000 | 35.100000 | 12.800000 | 1.700000 | 11.000000 | 2.500000 | 1.600000 | 5.500000 | 1.300000 | 0.200000 | 2.600000 | 7.100000 | 1.200000 |
Characteristics of Each DBSCAN Cluster¶
Cluster -1 (Outliers/Noise)¶
- Total Spent: $630.67 (Highest)
- Amount Spent per Purchase: $574.39
- Main Products Purchased:
- Wines: $316.75
- Meat Products: $173.77
- Gold Products: $45.68
- Channel Usage:
- Web Purchases: 4.20
- Catalog Purchases: 2.77
- Store Purchases: 5.93
- Web Visits: 5.23
- Deals Purchases: 2.38 (Moderate deal sensitivity)
Cluster 0 (Low Spenders)¶
- Total Spent: $38.39 (Minimal spending)
- Amount Spent per Purchase: $38.39
- Main Products Purchased:
- Wines: $11.30
- Meat Products: $11.25
- Gold Products: $6.27
- Channel Usage:
- Web Purchases: 1.36
- Catalog Purchases: 0.08 (Minimal)
- Store Purchases: 2.67
- Web Visits: 7.37 (High web engagement)
- Deals Purchases: 1.18 (Moderate)
Cluster 1 (Budget-Conscious Buyers)¶
- Total Spent: $35.10
- Amount Spent per Purchase: $35.10
- Main Products Purchased:
- Wines: $12.80
- Meat Products: $11.00
- Gold Products: $5.50
- Channel Usage:
- Web Purchases: 1.30
- Catalog Purchases: 0.20
- Store Purchases: 2.60
- Web Visits: 7.10 (Frequent visits)
- Deals Purchases: 1.20
Next Steps:¶
- Focus on Cluster -1: Personalized campaigns targeting high spenders.
- Engage Cluster 0: Increase engagement through online promotions and targeted discounts.
- Activate Cluster 1: Offer budget-friendly product bundles to boost purchases.
Gaussian Mixture Model¶
#Gaussian Mixture Model
gmm = GaussianMixture(n_components=2, random_state=42)
data['GMM_Cluster'] = gmm.fit_predict(scaled_data)
#Now analyze the clusters
cluster_characteristics_gmm = data.groupby('GMM_Cluster')[profiling_attributes].mean()
cluster_characteristics_gmm.reset_index(inplace=True)
print(cluster_characteristics_gmm)
GMM_Cluster Total_Spent Amount_Spent_Per_Purchase MntWines MntFruits \ 0 0 174.425432 174.425432 101.800628 5.383046 1 1 1174.710145 1049.696170 570.519669 53.891304 MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds \ 0 34.090267 7.674254 5.368132 20.109105 1 342.170807 76.894410 55.674948 75.559006 NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth \ 0 2.922292 0.842229 3.928571 6.339089 1 5.618012 5.062112 8.245342 3.967909 NumDealsPurchases 0 2.437991 1 2.175983
Visualize the clusters using PCA¶
# Visualize the GMM Clusters using PCA
pca_visual_gmm = PCA(n_components=2)
pca_result_visual_gmm = pca_visual_gmm.fit_transform(scaled_data)
pca_df_visual_gmm = pd.DataFrame(data=pca_result_visual_gmm, columns=['PC1', 'PC2'])
pca_df_visual_gmm['Cluster'] = data['GMM_Cluster']
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', data=pca_df_visual_gmm, palette='viridis', s=100, alpha=0.7)
plt.title('GMM Clusters Visualized Using PCA', fontsize=14)
plt.xlabel('Principal Component 1', fontsize=12)
plt.ylabel('Principal Component 2', fontsize=12)
plt.legend(title='Cluster')
plt.show()
Observations and Insights from GMM Results and PCA Visualization¶
Cluster 0: Low Spenders¶
- Spending Behavior:
- Total Spent: $174.43
- Amount Spent per Purchase: $174.43
- Main Products Purchased:
- Wines: $101.80
- Meat Products: $34.09
- Gold Products: $20.11
- Channel Usage:
- Web Purchases: 2.92 (Moderate)
- Catalog Purchases: 0.84 (Minimal)
- Store Purchases: 3.93 (Moderate)
- Web Visits: 6.34 (High engagement)
- Insights:
- Customers in this cluster represent a modest spending group, primarily engaged in wines and meat products.
- They show higher engagement with web visits, indicating potential for targeted online campaigns.
Cluster 1: High Spenders¶
- Spending Behavior:
- Total Spent: $1,174.71
- Amount Spent per Purchase: $1,049.69
- Main Products Purchased:
- Wines: $570.52
- Meat Products: $342.17
- Sweet Products: $55.67
- Gold Products: $75.56
- Channel Usage:
- Web Purchases: 5.62 (Very High)
- Catalog Purchases: 5.06 (High)
- Store Purchases: 8.25 (Dominant)
- Web Visits: 3.97 (Moderate)
- Insights:
- This cluster represents premium customers with a strong preference for high-value products such as wines, meat, and gold products.
- They utilize multiple channels (store, web, and catalog), suggesting potential for cross-channel marketing strategies.
PCA Plot Insights:¶
- The PCA plot shows clear separation between Cluster 0 and Cluster 1.
- Cluster 0 forms a dense group, indicating uniform behavior among low spenders.
- Cluster 1 has a more scattered distribution, reflecting varied high-spending patterns.
- The separation highlights GMM's ability to segment customers based on distinct spending and channel usage characteristics.
Recommendations:¶
- Cluster 0:
- Focus on increasing spending through targeted online promotions and loyalty programs.
- Highlight affordable product bundles, especially for wines and meat products.
- Cluster 1:
- Promote exclusive, high-value offerings such as wine subscriptions or premium bundles.
- Use personalized campaigns across store, web, and catalog channels to maximize engagement.
- Overall:
- Leverage the distinct separation in spending and channel usage for tailored marketing strategies.
Cluster Profiling¶
# For Cluster Profiling
cluster_characteristics_gmm = data.groupby('GMM_Cluster')[profiling_attributes].mean()
cluster_characteristics_gmm.reset_index(inplace=True)
print("Gaussian Mixture Model Cluster Characteristics:")
print(cluster_characteristics_gmm)
Gaussian Mixture Model Cluster Characteristics: GMM_Cluster Total_Spent Amount_Spent_Per_Purchase MntWines MntFruits \ 0 0 174.425432 174.425432 101.800628 5.383046 1 1 1174.710145 1049.696170 570.519669 53.891304 MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds \ 0 34.090267 7.674254 5.368132 20.109105 1 342.170807 76.894410 55.674948 75.559006 NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth \ 0 2.922292 0.842229 3.928571 6.339089 1 5.618012 5.062112 8.245342 3.967909 NumDealsPurchases 0 2.437991 1 2.175983
Characteristics of each cluster¶
Observations and Insights from GMM Cluster Profiling¶
Cluster 0: Low Spenders¶
- Spending Behavior:
- Total Spent: $174.43
- Amount Spent per Purchase: $174.43
- Main Products Purchased:
- Wines: $101.80
- Meat Products: $34.09
- Fish Products: $7.67
- Channel Usage:
- Web Purchases: 2.92
- Catalog Purchases: 0.84 (Minimal catalog engagement)
- Store Purchases: 3.93 (Moderate in-store activity)
- Web Visits: 6.34 (High web engagement)
- Deals Purchases: 2.44
Cluster 1: High Spenders¶
- Spending Behavior:
- Total Spent: $1,174.71
- Amount Spent per Purchase: $1,049.69
- Main Products Purchased:
- Wines: $570.52
- Meat Products: $342.17
- Gold Products: $75.56
- Sweet Products: $55.67
- Channel Usage:
- Web Purchases: 5.62 (High online engagement)
- Catalog Purchases: 5.06 (Strong catalog activity)
- Store Purchases: 8.25 (Dominant channel)
- Web Visits: 3.97 (Moderate web visits)
- Deals Purchases: 2.18
Insights:¶
Cluster 0:
- Customers are modest spenders focusing on specific products like wines and meat.
- They exhibit high engagement with web channels, making them ideal for online promotions.
Cluster 1:
- Represents premium customers with significantly higher spending across all product categories.
- Strong preference for in-store and catalog channels, indicating potential for exclusive campaigns in these mediums.
Recommendations:¶
Cluster 0:
- Leverage their high web engagement to offer online-exclusive discounts and loyalty rewards.
- Promote affordable bundles for wines and meat products to increase total spending.
Cluster 1:
- Design premium experiences, such as exclusive wine tastings or personalized catalog promotions.
- Focus marketing efforts on their preferred channels (store and catalog) with high-value offers.
Conclusion and Recommendations¶
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?
Comparison of Clustering Techniques and Relative Performance¶
Techniques Evaluated:¶
K-Means Clustering
- Optimal Clusters: Based on the Elbow Method and Silhouette Score (K=3).
- Strengths:
- Works well with clearly separable clusters.
- Scalable for large datasets.
- Limitations:
- Sensitive to outliers and requires pre-specification of the number of clusters.
K-Medoids Clustering
- Optimal Clusters: Identified using distortion minimization (K=3).
- Strengths:
- Robust to outliers compared to K-Means.
- Limitations:
- Computationally more expensive for large datasets.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- Optimal Parameters: eps=0.9, min_samples=10 (based on Silhouette Score).
- Strengths:
- Identifies clusters of arbitrary shapes.
- Handles noise and outliers effectively.
- Limitations:
- Performance highly dependent on hyperparameter selection.
Gaussian Mixture Model (GMM)
- Optimal Clusters: Based on Bayesian Information Criterion (BIC) and AIC (K=2).
- Strengths:
- Assigns probabilities to cluster memberships.
- Handles overlapping clusters better than other techniques.
- Limitations:
- Computationally intensive for large datasets.
Relative Performance Based on Silhouette Score:¶
- K-Means:
- Silhouette Score: Moderate (Varies based on cluster separation).
- K-Medoids:
- Silhouette Score: Slightly better than K-Means due to robustness to outliers.
- DBSCAN:
- Silhouette Score: Dependent on eps and min_samples but performs well for clusters with arbitrary shapes.
- GMM:
- Silhouette Score: Moderate; excels in handling overlapping clusters.
Recommendations for Use:¶
- K-Means: Suitable for datasets with well-defined and separable clusters.
- K-Medoids: Ideal for datasets with significant outliers or noise.
- DBSCAN: Best for datasets with non-linear, irregular clusters and noisy data.
- GMM: Recommended for probabilistic clustering and overlapping clusters.
Measure of Success:¶
The evaluation metrics include:
- Silhouette Score: Measures cluster cohesion and separation.
- Interpretability of Clusters: Assessed via profiling results.
- Computational Efficiency: Considered for large datasets.
Performance Analysis of Clustering Techniques¶
How do different techniques perform?¶
K-Means Clustering:
- Performs well for clearly defined, separable clusters.
- The Elbow Method identified K=3 as optimal, and the clusters are interpretable.
- Sensitive to outliers, which may slightly affect its performance in noisy datasets.
K-Medoids Clustering:
- Robust to outliers and noise, providing better cluster stability than K-Means.
- Produces well-defined clusters with K=3, with a moderate Silhouette Score.
DBSCAN:
- Identifies clusters of arbitrary shapes and handles outliers effectively.
- Performance depends heavily on
epsandmin_samplesvalues, which need careful tuning. - Excels in datasets with irregular cluster shapes or noisy data.
Gaussian Mixture Model (GMM):
- Handles overlapping clusters better by assigning probabilities to each cluster.
- Suitable for probabilistic clustering scenarios, but computationally expensive.
- Worked best with K=2 clusters as identified by Bayesian Information Criterion (BIC).
Which technique is performing relatively better?¶
- DBSCAN performs relatively better for this dataset, as it:
- Captures noise and outliers effectively.
- Identifies non-linear, irregular clusters.
- K-Medoids is also a strong contender due to its robustness to noise.
- K-Means and GMM are suitable for structured data but may underperform in datasets with noise or overlapping clusters.
Is there scope to improve performance further?¶
Tuning Hyperparameters:
- For DBSCAN, further exploring
epsandmin_samplesmay lead to improved clustering. - For K-Means and K-Medoids, experimenting with higher or lower values of K can reveal hidden cluster structures.
- For DBSCAN, further exploring
Dimensionality Reduction:
- Apply advanced dimensionality reduction techniques like t-SNE or UMAP to better visualize high-dimensional data.
Feature Engineering:
- Incorporate new features or transform existing ones (e.g., log transformations for skewed data).
Hybrid Approaches:
- Combine clustering techniques to leverage strengths (e.g., use DBSCAN for outlier detection and K-Medoids for clustering).
2. Refined insights:
- What are the most meaningful insights from the data relevant to the problem?
Key Insights:¶
Customer Segmentation:
- Two distinct customer segments emerged from GMM and DBSCAN:
- Low Spenders (Cluster 0): Customers with modest total spending, primarily purchasing wines and meat products.
- High Spenders (Cluster 1): Premium customers with significantly higher spending across all product categories, especially wines, meat, and gold products.
- Two distinct customer segments emerged from GMM and DBSCAN:
Channel Preferences:
- Low Spenders (Cluster 0):
- Prefer web channels, with high engagement through web visits.
- Minimal activity on catalog purchases.
- High Spenders (Cluster 1):
- Prefer in-store purchases, followed by catalog channels.
- Moderate web activity but higher engagement with multiple channels.
- Low Spenders (Cluster 0):
Product Preferences:
- Low Spenders: Spend more on affordable categories such as wines and meat.
- High Spenders: Diversify their spending across premium categories like wines, meat, sweets, and gold products.
Outlier Detection:
- DBSCAN successfully identified noise points (Cluster -1), which can represent atypical customer behavior or potential data errors.
- These outliers can be analyzed separately for actionable insights or excluded from modeling.
Actionable Insights:¶
For Low Spenders:
- Leverage their high web engagement to promote affordable product bundles and discounts.
- Focus on increasing spending through loyalty programs targeting wine and meat purchases.
For High Spenders:
- Design exclusive, high-value campaigns targeting their preference for premium products.
- Offer personalized experiences, such as wine subscriptions or catalog-based premium bundles.
Cross-Selling Opportunities:
- Introduce products like gold and sweets to low spenders through targeted promotions.
- Encourage high spenders to explore underutilized channels like web for a seamless omnichannel experience.
Customer Retention:
- For Cluster 1, provide personalized retention strategies, such as early access to premium products or loyalty benefits.
- For Cluster 0, use promotional discounts to increase overall spending and engagement.
3. Proposal for the final solution design:
- What model do you propose to be adopted? Why is this the best solution to adopt?
Proposed Model: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)¶
Why DBSCAN is the Best Solution to Adopt:¶
Robust to Outliers:
- DBSCAN is effective in identifying and isolating noise points (Cluster -1).
- This ensures that anomalous customer behaviors do not adversely impact clustering results.
Handles Arbitrary Cluster Shapes:
- DBSCAN is not constrained to linear or spherical clusters, making it ideal for datasets with irregular cluster structures.
No Pre-Specified Number of Clusters:
- Unlike K-Means and K-Medoids, DBSCAN does not require pre-defining the number of clusters.
- This makes it flexible for exploratory data analysis where the optimal number of clusters is not known in advance.
Meaningful Clusters:
- The clusters formed by DBSCAN align well with customer spending and channel usage patterns.
- Low Spenders and High Spenders are distinctly identified, providing actionable segmentation.
Noise Handling:
- Outliers detected as noise can represent potential data errors or unusual customer behaviors.
- Analyzing these outliers separately adds depth to customer insights.
Supporting Evidence:¶
- Silhouette Score:
- DBSCAN achieved competitive silhouette scores for optimal
epsandmin_samplesvalues, indicating well-defined clusters.
- DBSCAN achieved competitive silhouette scores for optimal
- Interpretability:
- Clusters formed by DBSCAN align with business objectives and provide clear segmentation for actionable strategies.
Limitations to Consider:¶
- DBSCAN's performance is sensitive to hyperparameter selection (
epsandmin_samples). - Careful tuning and domain expertise are required to achieve optimal clustering.
Additional Recommendations:¶
- Hybrid Approach:
- Use DBSCAN for outlier detection and noise isolation.
- Combine with K-Medoids for clustering stable data points.
- Iterative Refinement:
- Further explore
epsandmin_samplesvalues for DBSCAN to refine clusters. - Use domain knowledge to validate the cluster structures.
- Further explore
Conclusion:¶
Adopting DBSCAN as the primary clustering model provides a robust and flexible solution for customer segmentation. Its ability to handle noise and identify clusters of arbitrary shapes makes it the best fit for the given dataset and problem context.